intel-intrinsics 1.2.8
The most practical D SIMD solution! Using SIMD intrinsics with Intel syntax with D.
To use this package, run the following command in your project's root directory:
Manual usage
Put the following dependency into your project's dependences section:
intel-intrinsics
The DUB package intel-intrinsics
implements Intel intrinsics for D.
intel-intrinsics
lets you use x86 SIMD in D with support for LDC / DMD / GDC with a single syntax and API.
It can target AArch64 for full-speed with Apple Silicon.
"dependencies":
{
"intel-intrinsics": "~>1.0"
}
Features
SIMD intrinsics with _mm_
prefix
DMD | LDC x86 | LDC AArch64 | GDC | ||
---|---|---|---|---|---|
MMX | Yes but slow (#16) | Yes | Yes but some slow (#45) | Yes (slow in 32-bit) | |
SSE | Yes but slow (#16) | Yes | Yes but some slow (#45) | Yes (slow in 32-bit) | |
SSE2 | Yes but slow (#16) | Yes | Yes but some slow (#45) | Yes (slow in 32-bit) | |
SSE3 | Yes but slow (#16) | Yes (use -mattr=+sse3) | Yes but some slow (#45) | Yes but slow (#39) | |
SSSE3 | No | No | No | No | |
... | No | No | No | No |
The intrinsics implemented follow the syntax and semantics at: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
The philosophy (and guarantee) of intel-intrinsics
is:
- When using LDC,
intel-intrinsics
should generate optimal code else it's a bug. - No promise that the exact instruction is generated, because it's not always the fastest thing to do.
- Guarantee that the semantics of the intrinsic is preserved, above all other consideration.
SIMD types
intel-intrinsics
define the following types whatever the compiler:
long1
, float2
, int2
, short4
, byte8
, float4
, int4
, double2
though most of the time you would deal with
alias __m128 = float4;
alias __m128i = int4; // and you can rely on __m128i being int4
alias __m128d = double2;
alias __m64 = long1;
Vector Operators for all
intel-intrinsics
implements Vector Operators for compilers that don't have __vector
support (DMD with 32-bit x86 target).
Example:
__m128 add_4x_floats(__m128 a, __m128 b)
{
return a + b;
}
is the same as:
__m128 add_4x_floats(__m128 a, __m128 b)
{
return _mm_add_ps(a, b);
}
Individual element access
It is recommended to do it in that way for maximum portability:
__m128i A;
// recommended portable way to set a single SIMD element
A.ptr[0] = 42;
// recommended portable way to get a single SIMD element
int elem = A.array[0];
Why intel-intrinsics
?
- Portability
It just works the same for DMD, LDC, and GDC.
When using LDC,
intel-intrinsics
allows to target AArch64 with the same semantics. - Capabilities
Some instructions just aren't accessible using
core.simd
andldc.simd
capabilities. For example:pmaddwd
which is so important in digital video. Some instructions need an almost exact sequence of LLVM IR to get generated.ldc.intrinsics
is a moving target and you need a layer on top of it. - Familiarity Intel intrinsic syntax is more familiar to C and C++ programmers. The Intel intrinsics names aren't good, but they are known identifiers. The problem with introducing new names is that you need hundreds of new identifiers.
- Documentation There is a convenient online guide provided by Intel: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ Without this Intel documentation, it's much more difficult to write sizeable SIMD code.
Notable difference between x86 and AArch64 target
AArch64 respects floating-point rounding through MXCSR emulation. This works using FPCR as thread-local store for rounding mode.
Some features of MXCSR are absent:
- Getting floating-point exception status
- Setting floating-point exception masks
- Separate control for denormals-are-zero and flush-to-zero (ARM has one bit for both)
Notable difference vs C/C++ or core.simd
When using intel-intrinsics
, every implicit conversion of similarly-sized vectors should be done with a cast
instead.
__m128i b = _mm_set1_epi32(42);
__m128 a = b; // NO, only works in LDC
__m128 a = cast(__m128)b; // YES, works in all D compilers
This is because D does not allow user-defined implicit conversions, and core.simd
might be emulated (DMD). Use this cast
, or your code won't work in every D compiler variation.
Who is using it?
dg2d
is a very fast 2D renderer- Auburn Sounds audio products
- Cut Through Recordings audio products
Video introduction
In this DConf 2019 talk, Auburn Sounds:
- introduces how
intel-intrinsics
came to be, - demonstrates a 3.5x speed-up for some particular loops,
- reminds that normal D code can be really fast and intrinsics might harm performance
See the talk: intel-intrinsics: Not intrinsically about intrinsics
- Registered by ponce
- 1.2.8 released 4 years ago
- AuburnSounds/intel-intrinsics
- BSL-1.0
- Auburn Sounds 2016-2018
- Dependencies:
- none
- Versions:
-
1.11.20 2024-Aug-13 1.11.19 2024-Jul-21 1.11.18 2024-Jan-03 1.11.17 2023-Dec-17 1.11.16 2023-Dec-03 - Download Stats:
-
-
4 downloads today
-
64 downloads this week
-
384 downloads this month
-
118905 downloads total
-
- Score:
- 3.9
- Short URL:
- intel-intrinsics.dub.pm