bfloat16 for Cooper Lake Xeon Scalable only?



[ad_1]

Intel has recently released a new version of its software developer document, revealing additional details about its future Cooper-Xeon Scalable 'Cooper Lake-SP' processors. As it appears, the new CPUs will support the AVX512_BF16 instructions and thus the bfloat16 format. Meanwhile, the main plot here is the fact that at this point, AVX512_BF16 seems to be uniquely supported by Cooper Lake-SP microarchitecture, but not by its direct successor, the Ice Lake-microarchitecture. MS.

Bfloat16 is a 16-bit truncated version of the 32-bit IEEE 754 single-precision floating-point format that preserves 8-bit exponents, but reduces the accuracy of meaning from 24-bit to 8-bit to save memory, bandwidth, and memory . processing resources, while maintaining the same range. The bfloat16 format was designed primarily for machine-learning and near-sensor computing applications, where accuracy close to 0 but not more important is required. Digital representation is supported by Intel's next FPGAs as well as Nervana Neural Network Processors and Google TPUs. Since Intel supports the bfloat16 format in two of its product lines, it is a good idea to support it elsewhere. That's what the company plans to do by adding its instructions AVX512_BF16 to its next edition of Cooper Platform SP.

Support for AVX-512 by different Intel processors
UArch newer supports older uArch
Xeon General Xeon Phi
Skylake-SP AVX512BW
AVX512DQ
AVX512VL
AVX512F
AVX512CD
AVX512ER
AVX512PF
Knights of landing
Cannon Lake AVX512VBMI
AVX512IFMA
AVX512_4FMAPS
AVX512_4VNNIW
Knights' Mill
Cascade Lake-SP AVX512_VNNI
Cooper Lake AVX512_BF16
Ice Lake AVX512_VNNI
AVX512_VBMI2
AVX512_BITALG
AVX512 + VAES
AVX512 + GFNI
AVX512 + VPCLMULQDQ
(not BF16)
AVX512_VPOPCNTDQ
Source: Intel Architecture Instruction Set Extension Programming Reference (pages 16)

The list of Intel AVX512_BF16 vector neural network instructions includes VCVTNE2PS2BF16, VCVTNEPS2BF16 and VDPBF16PS. All can be executed in 128-bit, 256-bit or 512-bit mode. Software developers can choose one of nine versions based on their needs.

Intel AVX512_BF16 Instructions
The intrinsic equivalent of the Intel C / C ++ compiler
Instruction The description
VCVTNE2PS2BF16 Convert two unique data into one BF16 data packet

The intrinsic equivalent of the Intel C / C ++ compiler:
VCVTNE2PS2BF16 __m128bh _mm_cvtne2ps_pbh (__m128, __m128);
VCVTNE2PS2BF16 __m128bh _mm_mask_cvtne2ps_pbh (__m128bh, __mmask8, __m128, __m128);
VCVTNE2PS2BF16 __m128bh _mm_maskz_cvtne2ps_pbh (__mmask8, __m128, __m128);
VCVTNE2PS2BF16 __m256bh _mm256_cvtne2ps_pbh (__m256, __m256);
VCVTNE2PS2BF16 __m256bh _mm256_mask_cvtne2ps_pbh (__m256bh, __mmask16, __m256, __m256);
VCVTNE2PS2BF16 __m256bh _mm256_maskz_cvtne2ps_pbh (__mmask16, __m256, __m256);
VCVTNE2PS2BF16 __m512bh _mm512_cvtne2ps_pbh (__m512, __m512);
VCVTNE2PS2BF16 __m512bh _mm512_mask_cvtne2ps_pbh (__m512bh, __mmask32, __m512, __m512);
VCVTNE2PS2BF16 __m512bh _mm512_maskz_cvtne2ps_pbh (__mmask32, __m512, __m512);

VCVTNEPS2BF16 Convert packaged unique data into packaged BF16 data

The intrinsic equivalent of the Intel C / C ++ compiler:
VCVTNEPS2BF16 __m128bh _mm_cvtneps_pbh (__m128);
VCVTNEPS2BF16 __m128bh _mm_mask_cvtneps_pbh (__m128bh, __mmask8, __m128);
VCVTNEPS2BF16 __m128bh _mm_maskz_cvtneps_pbh (__mmask8, __m128);
VCVTNEPS2BF16 __m128bh _mm256_cvtneps_pbh (__m256);
VCVTNEPS2BF16 __m128bh _mm256_mask_cvtneps_pbh (__m128bh, __mmask8, __m256);
VCVTNEPS2BF16 __m128bh _mm256_maskz_cvtneps_pbh (__mmask8, __m256);
VCVTNEPS2BF16 __m256bh _mm512_cvtneps_pbh (__m512);
VCVTNEPS2BF16 __m256bh _mm512_mask_cvtneps_pbh (__m256bh, __mmask16, __m512);
VCVTNEPS2BF16 __m256bh _mm512_maskz_cvtneps_pbh (__mmask16, __m512);

VDPBF16PS Point product of BF16 pairs accumulated in simple packed precision

The intrinsic equivalent of the Intel C / C ++ compiler:
VDPBF16PS __m128 _mm_dpbf16_ps (__ m128, __m128bh, __m128bh);
VDPBF16PS __m128 _mm_mask_dpbf16_ps (__m128, __mmask8, __m128bh, __m128bh);
VDPBF16PS __m128 _mm_maskz_dpbf16_ps (__mmask8, __m128, __m128bh, __m128bh);
VDPBF16PS __m256 _mm256_dpbf16_ps (__ m256, __m256bh, __m256bh);
VDPBF16PS __m256 _mm256_mask_dpbf16_ps (__ m256, __mmask8, __m256bh, __m256bh);
VDPBF16PS __m256 _mm256_maskz_dpbf16_ps (__mmask8, __m256, __m256bh, __m256bh);
VDPBF16PS __m512 _mm512_dpbf16_ps (__ m512, __m512bh, __m512bh);
VDPBF16PS __m512 _mm512_mask_dpbf16_ps (__ m512, __mmask16, __m512bh, __m512bh);
VDPBF16PS __m512 _mm512_maskz_dpbf16_ps (__mmask16, __m512, __m512bh, __m512bh);

Only for Cooper Lake?

When Intel mentions an instruction in the Intel Architecture Extensions and Future Features programming references, the company typically indicates the first microarchitecture to support it and indicates that its successors also support it (or are configured to take it in charge) by calling them "later". omitting the word microarchitecture. For example, Intel's original AVX is supported by Intel's "Sandy Bridge and later".

This is not the case with AVX512_BF16. This one would be supported by Future Cooper Lake. Meanwhile, after the Cooper Lake-SP platform, the long awaited 10nm Ice Lake-SP server platform will be a bit strange not to support anything from its predecessor. However, this is not a totally impossible scenario. Intel wants to offer differentiated solutions in recent days. So, adapt Cooper Lake-SP to some workloads while concentrating Ice Lake-SP on others.

We have contacted Intel for additional information and will update the story if we get additional details about it.

Related reading

Source: Intel Architecture Instruction Set Extension Programming Reference (via InstLatX64 / Twitter)

[ad_2]

Source link