Single Instruction Multiple Data (SIMD)
Single Instruction Multiple Data (SIMD) is a parallel computing technology where a single instruction operates on multiple values at the same time. SIMD is used to implement artificial intelligence (AI), machine learning (ML), high performance graphics, and other data science solutions. SIMD enables efficient data processing and increased computational performance.
The Harlinn.Common.Core library allows developers to create highly optimized SIMD code, without the complexity usually associated with using the SIMD intrinsic operations:
using Vector = Math::Vector<float, 4>;
Vector v1( 1.0f, 2.0f, 3.f, 1.0f );
Vector v2( 5.0f, 3.0f, 3.f, 2.0f );
Vector v3 = (v2 + Dot(v1,v2)) / Sin(v1);
The above example uses the Fast Linear Algebra Classes for Games and Graphics, that
were created using the Traits<T,N>
specializations discussed below.
Single Instruction Multiple Data
The interface to SIMD operations in Harlinn.Common.Core
is provided in HCCSIMD.h
contained within the Harlinn::Common::Core::SIMD
namespace through a set of specializations
of the template:
template<typename T, size_t N>
struct Traits : public std::false_type
{
};
The motivation for creating the Traits<T,N>
specializations is that they can be used to implement C++ templates
providing a common implementation that can be used with multiple data types. Like the multiplication for
quaternions:
template<typename T>
requires IsFloatingPoint<T>
QuaternionSimd<T> QuaternionMultiply(const QuaternionSimd<T>& q1,
const QuaternionSimd<T>& q2 )
{
using QuaternionSimd = QuaternionSimd<T>;
using Traits = typename QuaternionSimd::Traits;
using FloatT = typename Traits::Type;
using Constants = Constants<FloatT>;
constexpr Traits::SIMDType controlWZYX =
{ { Constants::One, Constants::MinusOne,
Constants::One, Constants::MinusOne } };
constexpr Traits::SIMDType controlZWXY =
{ { Constants::One, Constants::One,
Constants::MinusOne, Constants::MinusOne } };
constexpr Traits::SIMDType controlYXWZ =
{ { Constants::MinusOne, Constants::One,
Constants::One, Constants::MinusOne } };
auto q2X = Traits::At<0>( q2.simd );
auto q2Y = Traits::At<1>( q2.simd );
auto q2Z = Traits::At<2>( q2.simd );
auto result = Traits::At<3>( q2.simd );
result = Traits::Mul( result, q1.simd );
auto q1Swizzle = Traits::Swizzle<0, 1, 2, 3>( q1.simd );
q2X = Traits::Mul( q2X, q1Swizzle );
q1Swizzle = Traits::Swizzle<2, 3, 0, 1>( q1Swizzle );
result = Traits::FMAdd( q2X, controlWZYX, result );
q2Y = Traits::Mul( q2Y, q1Swizzle );
q1Swizzle = Traits::Swizzle<0, 1, 2, 3>( q1Swizzle );
q2Y = Traits::Mul( q2Y, controlZWXY );
q2Z = Traits::Mul( q2Z, q1Swizzle );
q2Y = Traits::FMAdd( q2Z, controlYXWZ, q2Y );
result = Traits::Add( result, q2Y );
return result;
}
The Traits<T,N>
specializations provides access to the SIMD intrinsic functions,
and strives to maintain a uniform set of names for the implemented functions. The
second template argument is sometimes used to select function implementations
optimized for N
elements in the SIMD type.
T
is the type of the values to use in the computations, and N
is
the required number of valid elements in the SIMD type. The SIMD type is a representation
of a CPU register used for SIMD operations. The Intel/AMD x64 architecture
provides several SIMD types:
__m128
types: These types are used with the Streaming SIMD Extensions and Streaming SIMD Extensions 2 instructions intrinsics, and maps to the XMM[0-7] registers.__m128
: Holds four 32-bit floating point values.__m128i
: Holds sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values.__m128d
: Holds two 64-bit floating point values.
__m256
types: These types are used with the AVX/AVX2 extensions, and maps to the YMM[0-15] registers.__m256
: Holds eight 32-bit floating point values.__m256i
: Holds thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values.__m256d
: Holds four 64-bit double precision floating point values.
__m512
types: These types are used with the AVX 512 extensions, and maps to the ZMM[0-31] registers.__m512
: Holds sixteen 32-bit floating point values.__m512i
: Holds sixty-four 8-bit, thirty-two 16-bit, sixteen 32-bit, or eight 64-bit integer values.__m512d
: Holds eight 64-bit double precision floating point values.
The above data types are not basic C/C++ data types, and have several usage restrictions:
- Can only be used on either side of an assignment, as a return value, or as a parameter. Cannot be used in regular arithmetic expressions
+
,-
,*
,/
, etc. - Can be used as objects in aggregates, such as unions, to access the byte elements and structures.
- Can only be operated on using the intrinsic functions implemented by the compiler.
Before doing any operations on SIMD type variables, they must be loaded using special intrinsic functions:
_mm256_load_ps( ptrToFloats );
Where, in this case, ptrToFloats
is a 32 byte aligned pointer to 8 single precision floating point values.
The Intel Intrinsics Guide
The Intel Intrinsics Guide contains a brief description of the available intrinsic functions. It’s a bit short on details, but when needed - the relevant details can be found in the Intel 64 and IA-32 Architectures Software Developer Manuals.
SIMD::Traits<T,N>
All the intrinsic functions have unique, C style names, making it awkward to use them directly inside
templates. The intrinsic for adding two __m128
variables is called _mm_add_ps
:
auto v3 = _mm_add_ps(v1,v2);
_mm_add_ps
operates on single precision floating point values, adding the four values held by
v1
to the corresponding four values held by v2
and storing the result in v3
.
The intrinsic for adding two __m128d
variables is called _mm_add_pd
:
auto v3 = _mm_add_pd(v1,v2);
_mm_add_pd
does the same job as _mm_add_ps
, but operates on double precision floating point values,
and only two of those fits inside a __m128d
variable.
The intrinsic that actually performs the same operation as _mm_add_ps
for
double precision floating point values is called _mm256_add_pd
:
auto v3 = _mm256_add_pd(v1,v2);
_mm256_add_pd
works with two __m256d
variables, and they hold four double
precision floating point values.
The Traits<T,N>
specializations are a practical solution, wrapping the various
intrinsic functions, making them accessible using the same name:
using Traits = Traits<float,4>;
auto v3 = Traits::Add(v1,v2);
Traits<float,4>
and Traits<double,4>
holds the same number of elements.
using Traits = Traits<double,4>;
auto v3 = Traits::Add(v1,v2);
This can be used to implement a template that adds the corresponding elements of two vectors:
template<typename ValueT>
requires std::is_same_v<ValueT,char> || std::is_same_v<ValueT, Byte> ||
std::is_same_v<ValueT, SByte> || std::is_same_v<ValueT, Int16> ||
std::is_same_v<ValueT, UInt16> || std::is_same_v<ValueT, Int32> ||
std::is_same_v<ValueT, UInt32> || std::is_same_v<ValueT, Int64> ||
std::is_same_v<ValueT, UInt64> || std::is_same_v<ValueT, float> ||
std::is_same_v<ValueT, double>
std::vector<ValueT> Add( const std::vector<ValueT>& v1,
const std::vector<ValueT>& v2 )
{
using Limits = SIMD::TraitLimits<ValueT>;
using Traits = SIMD::Traits<ValueT, Limits::Size>;
size_t resultSize = std::max( v1.size( ), v2.size( ) );
size_t operationCount = std::min( v1.size( ), v2.size( ) );
size_t iterationCount = operationCount / Limits::Size;
size_t remainingCount = operationCount % Limits::Size;
std::vector result;
result.resize( resultSize );
const auto* p1 = v1.data( );
const auto* p2 = v2.data( );
auto* pR = result.data( );
for ( size_t i = 0; i < iterationCount; ++i )
{
Traits::Store( pR, Traits::Add( Traits::Load( p1 ), Traits::Load( p2 ) ) );
p1 += Limits::Size;
p2 += Limits::Size;
pR += Limits::Size;
}
for ( size_t i = 0; i < remainingCount; ++i )
{
*pR = *p1 + *p2;
p1++;
p2++;
pR++;
}
if ( v1.size( ) > v2.size( ) )
{
std::copy( p1, p1 + ( resultSize - operationCount ), pR );
}
else if ( v1.size( ) < v2.size( ) )
{
std::copy( p2, p2 + ( resultSize - operationCount ), pR );
}
return result;
}
The above example works for any of the supported data types, and uses SIMD::TraitLimits<ValueT>::Size
to get the optimal value of N
for the type ValueT
.
The number of available intrinsic functions can be a bit overwhelming, and the Traits<T,N>
specializations
groups the intrinsic functions according to the type they are designed to work with. The various
Traits<T,N>
specializations do not support the same set of functions, but do share a common subset
of operations.
Traits<float,4>::Swizzle<0,1,2,3>( v )
and Traits<double,4>::Swizzle<0,1,2,3>( v )
will
both reverse the order of the 4
elements in v
, and the first will invoke _mm_permute_ps
while
the second invokes _mm256_permute4x64_pd
, not _mm256_permute_pd
which can only swap values inside a 128 bit lane,
and not _mm_permute_pd
which operates on the __m128d
SIMD type and can only hold two double precision values.
The template arguments mimics the _MM_SHUFFLE
macro, where the first argument selects the value
that goes into the highest position in the resulting SIMD type, the second argument selects the value
that goes into the second highest positions, the third selects the value that goes into the second lowest
position, and the fourth selects the value that goes into the lowest position. 0
selects the value
from the lowest position in v
, while 3
selects the value from the highest position in v
.
There are more than 80
intrinsic functions with permute in their name, and just separating them
by the type they operate on makes SIMD development more manageable.
The C/C++ compiler handles register allocation and instruction scheduling, but using the SIMD extensions can be complex:
auto rmm1 = _mm_shuffle_ps( a, a, _MM_SHUFFLE( 3, 0, 2, 1 ) );
auto rmm2 = _mm_shuffle_ps( b, b, _MM_SHUFFLE( 3, 1, 0, 2 ) );
auto rmm3 = _mm_mul_ps( rmm1, b );
auto rmm4 = _mm_mul_ps( rmm1, rmm2 );
auto rmm5 = _mm_shuffle_ps( rmm3, rmm3, _MM_SHUFFLE( 3, 0, 2, 1 ) );
return _mm_sub_ps( rmm4, rmm5 );
The above computes the cross product between two vectors, a
and b
, each containing
three 32-bit floating point values.
Using the SIMD::Traits<float,3>
, reduces the above to:
return SIMD::Traits<float,3>::Cross( a, b );
While:
return SIMD::Traits<float,2>::Cross( a, b );
calculates the cross product for vectors with only two elements, expanding to:
auto rmm1 = _mm_permute_ps( b, _MM_SHUFFLE( 0, 1, 0, 1 ) );
rmm1 = _mm_mul_ps( rmm1, a );
auto rmm2 = _mm_permute_ps( rmm1, _MM_SHUFFLE( 1, 1, 1, 1 ) );
rmm1 = _mm_sub_ss( rmm1, rmm2 );
return _mm_permute_ps( rmm1, _MM_SHUFFLE( 0, 0, 0, 0 ) );
Both operates on a __m128
holding four 32-bit floating point values, but
last one calculates the cross product using only the values contained within the two
lowest order positions of the __m128
.