Crate packed_simd
source ·Expand description
Portable packed SIMD vectors
This crate is proposed for stabilization as std::packed_simd
in RFC2366:
std::simd
.
The examples available in the
examples/
sub-directory of the crate showcase how to use the library in practice.
Table of contents
Introduction
This crate exports Simd<[T; N]>
: a packed vector of N
elements of type T
as well as many type aliases for this type: for
example, f32x4
, which is just an alias for Simd<[f32; 4]>
.
The operations on packed vectors are, by default, “vertical”, that is, they are applied to each vector lane in isolation of the others:
let a = i32x4::new(1, 2, 3, 4);
let b = i32x4::new(5, 6, 7, 8);
assert_eq!(a + b, i32x4::new(6, 8, 10, 12));
Many “horizontal” operations are also provided:
assert_eq!(a.wrapping_sum(), 10);
In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, much slower. That is, the most portably-efficient way of performing a reduction over a slice is to collect the results into a vector using vertical operations, and performing a single horizontal operation at the end:
fn reduce(x: &[i32]) -> i32 {
assert_eq!(x.len() % 4, 0);
let mut sum = i32x4::splat(0); // [0, 0, 0, 0]
for i in (0..x.len()).step_by(4) {
sum += i32x4::from_slice_unaligned(&x[i..]);
}
sum.wrapping_sum()
}
let x = [0, 1, 2, 3, 4, 5, 6, 7];
assert_eq!(reduce(&x), 28);
Vector types
The vector type aliases are named according to the following scheme:
{element_type}x{number_of_lanes} == Simd<[element_type; number_of_lanes]>
where the following element types are supported:
i{element_width}
: signed integeru{element_width}
: unsigned integerf{element_width}
: floatm{element_width}
: mask (see below)*{const,mut} T
:const
andmut
pointers
Basic operations
// Sets all elements to `0`:
let a = i32x4::splat(0);
// Reads a vector from a slice:
let mut arr = [0, 0, 0, 1, 2, 3, 4, 5];
let b = i32x4::from_slice_unaligned(&arr);
// Reads the 4-th element of a vector:
assert_eq!(b.extract(3), 1);
// Returns a new vector where the 4-th element is replaced with `1`:
let a = a.replace(3, 1);
assert_eq!(a, b);
// Writes a vector to a slice:
let a = a.replace(2, 1);
a.write_to_slice_unaligned(&mut arr[4..]);
assert_eq!(arr, [0, 0, 0, 1, 0, 0, 1, 1]);
Conditional operations
One often needs to perform an operation on some lanes of the vector. Vector
masks, like m32x4
, allow selecting on which vector lanes an operation is
to be performed:
let a = i32x4::new(1, 1, 2, 2);
// Add `1` to the first two lanes of the vector.
let m = m16x4::new(true, true, false, false);
let a = m.select(a + 1, a);
assert_eq!(a, i32x4::splat(2));
The elements of a vector mask are either true
or false
. Here true
means that a lane is “selected”, while false
means that a lane is not
selected.
All vector masks implement a mask.select(a: T, b: T) -> T
method that
works on all vectors that have the same number of lanes as the mask. The
resulting vector contains the elements of a
for those lanes for which the
mask is true
, and the elements of b
otherwise.
The example constructs a mask with the first two lanes set to true
and
the last two lanes set to false
. This selects the first two lanes of a + 1
and the last two lanes of a
, producing a vector where the first two
lanes have been incremented by 1
.
note: mask
select
can be used on vector types that have the same number of lanes as the mask. The example shows this by usingm16x4
instead ofm32x4
. It is typically more performant to use a mask element width equal to the element width of the vectors being operated upon. This is, however, not true for 512-bit wide vectors when targeting AVX-512, where the most efficient masks use only 1-bit per element.
All vertical comparison operations returns masks:
let a = i32x4::new(1, 1, 3, 3);
let b = i32x4::new(2, 2, 0, 0);
// ge: >= (Greater Eequal; see also lt, le, gt, eq, ne).
let m = a.ge(i32x4::splat(2));
if m.any() {
// all / any / none allow coherent control flow
let d = m.select(a, b);
assert_eq!(d, i32x4::new(2, 2, 3, 3));
}
Conversions
-
lossless widening conversions:
From
/Into
are implemented for vectors with the same number of lanes when the conversion is value preserving (same as instd
). -
safe bitwise conversions: The cargo feature
into_bits
provides theIntoBits/FromBits
traits (x.into_bits()
). These perform safe bitwisetransmute
s when all bit patterns of the source type are valid bit patterns of the target type and are also implemented for the architecture-specific vector types ofstd::arch
. For example,let x: u8x8 = m8x8::splat(true).into_bits();
is provided because allm8x8
bit patterns are validu8x8
bit patterns. However, the opposite is not true, not allu8x8
bit patterns are validm8x8
bit-patterns, so this operation cannot be performed safely usingx.into_bits()
; one needs to useunsafe { crate::mem::transmute(x) }
for that, making sure that the value in theu8x8
is a valid bit-pattern ofm8x8
. -
numeric casts (
as
): are performed usingFromCast
/Cast
(x.cast()
), just likeas
:-
casting integer vectors whose lane types have the same size (e.g.
i32xN
->u32xN
) is a no-op, -
casting from a larger integer to a smaller integer (e.g.
u32xN
->u8xN
) will truncate, -
casting from a smaller integer to a larger integer (e.g.
u8xN
->u32xN
) will:- zero-extend if the source is unsigned, or
- sign-extend if the source is signed,
-
casting from a float to an integer will round the float towards zero,
-
casting from an integer to float will produce the floating point representation of the integer, rounding to nearest, ties to even,
-
casting from an
f32
to anf64
is perfect and lossless, -
casting from an
f64
to anf32
rounds to nearest, ties to even.
Numeric casts are not very “precise”: sometimes lossy, sometimes value preserving, etc.
-
Hardware Features
This crate can use different hardware features based on your configured
RUSTFLAGS
. For example, with no configured RUSTFLAGS
, u64x8
on
x86_64 will use SSE2 operations like PCMPEQD
. If you configure
RUSTFLAGS='-C target-feature=+avx2,+avx'
on supported x86_64 hardware
the same u64x8
may use wider AVX2 operations like VPCMPEQQ
. It is
important for performance and for hardware support requirements that
you choose an appropriate set of target-feature
and target-cpu
options during builds. For more information, see the Performance
guide
Re-exports
pub use crate::sealed::Shuffle;
Macros
- Shuffles vector elements.
Structs
- Wrapper over
T
implementing a lexicoraphical order via thePartialOrd
and/orOrd
traits. - Packed SIMD vector type.
- 8-bit wide mask.
- 16-bit wide mask.
- 32-bit wide mask.
- 64-bit wide mask.
- 128-bit wide mask.
- isize-wide mask.
Traits
- Numeric cast from
Self
toT
. - FromBits
into_bits
Safe lossless bitwise conversion fromT
toSelf
. - Numeric cast from
T
toSelf
. - IntoBits
into_bits
Safe lossless bitwise conversion fromSelf
toT
. - This trait is implemented by all mask types
- Trait implemented by arrays that can be SIMD types.
- This trait is implemented by all SIMD vector types.
Type Definitions
- A vector with 2
*const T
lanes - A vector with 4
*const T
lanes - A vector with 8
*const T
lanes - A 64-bit vector with 2
f32
lanes. - A 128-bit vector with 4
f32
lanes. - A 256-bit vector with 8
f32
lanes. - A 512-bit vector with 16
f32
lanes. - A 128-bit vector with 2
f64
lanes. - A 256-bit vector with 4
f64
lanes. - A 512-bit vector with 8
f64
lanes. - A 16-bit vector with 2
i8
lanes. - A 32-bit vector with 4
i8
lanes. - A 64-bit vector with 8
i8
lanes. - A 128-bit vector with 16
i8
lanes. - A 256-bit vector with 32
i8
lanes. - A 512-bit vector with 64
i8
lanes. - A 32-bit vector with 2
i16
lanes. - A 64-bit vector with 4
i16
lanes. - A 128-bit vector with 8
i16
lanes. - A 256-bit vector with 16
i16
lanes. - A 512-bit vector with 32
i16
lanes. - A 64-bit vector with 2
i32
lanes. - A 128-bit vector with 4
i32
lanes. - A 256-bit vector with 8
i32
lanes. - A 512-bit vector with 16
i32
lanes. - A 128-bit vector with 2
i64
lanes. - A 256-bit vector with 4
i64
lanes. - A 512-bit vector with 8
i64
lanes. - A 128-bit vector with 1
i128
lane. - A 256-bit vector with 2
i128
lanes. - A 512-bit vector with 4
i128
lanes. - A vector with 2
isize
lanes. - A vector with 4
isize
lanes. - A vector with 8
isize
lanes. - A 16-bit vector mask with 2
m8
lanes. - A 32-bit vector mask with 4
m8
lanes. - A 64-bit vector mask with 8
m8
lanes. - A 128-bit vector mask with 16
m8
lanes. - A 256-bit vector mask with 32
m8
lanes. - A 512-bit vector mask with 64
m8
lanes. - A 32-bit vector mask with 2
m16
lanes. - A 64-bit vector mask with 4
m16
lanes. - A 128-bit vector mask with 8
m16
lanes. - A 256-bit vector mask with 16
m16
lanes. - A 512-bit vector mask with 32
m16
lanes. - A 64-bit vector mask with 2
m32
lanes. - A 128-bit vector mask with 4
m32
lanes. - A 256-bit vector mask with 8
m32
lanes. - A 512-bit vector mask with 16
m32
lanes. - A 128-bit vector mask with 2
m64
lanes. - A 256-bit vector mask with 4
m64
lanes. - A 512-bit vector mask with 8
m64
lanes. - A 128-bit vector mask with 1
m128
lane. - A 256-bit vector mask with 2
m128
lanes. - A 512-bit vector mask with 4
m128
lanes. - A vector with 2
*mut T
lanes - A vector with 4
*mut T
lanes - A vector with 8
*mut T
lanes - A vector mask with 2
msize
lanes. - A vector mask with 4
msize
lanes. - A vector mask with 8
msize
lanes. - A 16-bit vector with 2
u8
lanes. - A 32-bit vector with 4
u8
lanes. - A 64-bit vector with 8
u8
lanes. - A 128-bit vector with 16
u8
lanes. - A 256-bit vector with 32
u8
lanes. - A 512-bit vector with 64
u8
lanes. - A 32-bit vector with 2
u16
lanes. - A 64-bit vector with 4
u16
lanes. - A 128-bit vector with 8
u16
lanes. - A 256-bit vector with 16
u16
lanes. - A 512-bit vector with 32
u16
lanes. - A 64-bit vector with 2
u32
lanes. - A 128-bit vector with 4
u32
lanes. - A 256-bit vector with 8
u32
lanes. - A 512-bit vector with 16
u32
lanes. - A 128-bit vector with 2
u64
lanes. - A 256-bit vector with 4
u64
lanes. - A 512-bit vector with 8
u64
lanes. - A 128-bit vector with 1
u128
lane. - A 256-bit vector with 2
u128
lanes. - A 512-bit vector with 4
u128
lanes. - A vector with 2
usize
lanes. - A vector with 4
usize
lanes. - A vector with 8
usize
lanes.