simd 0.1.1

`simd` offers limited cross-platform access to SIMD instructions on CPUs, as well as raw interfaces to platform-specific instructions.
## x86-64

nbody 4.86 6.48 5.43 5.40
spectral-norm 2.51 5.04 2.56 2.51
spectral-norm (5500) 3.04 6.07 3.10 3.06
fannkuch-redux 1.47 3.27 1.54 1.14
fannkuch-redux (12) 19.3 43.9 20.1 14.8

4x4 f32 matrix:
inverse 1872 4629
multiply 876 2654
transpose 291 770

test mandel_naive ... bench:     948,074 ns/iter (+/- 2,595)
test mandel_simd4 ... bench:     295,489 ns/iter (+/- 1,388)

## aarch64

test inverse_naive   ... bench:       7,349 ns/iter (+/- 755)
test inverse_simd4   ... bench:       2,215 ns/iter (+/- 221)
test multiply_naive  ... bench:       3,375 ns/iter (+/- 98)
test multiply_simd4  ... bench:       1,233 ns/iter (+/- 7)
test transpose_naive ... bench:         504 ns/iter (+/- 132)
test transpose_simd4 ... bench:         300 ns/iter (+/- 8)

test mandel_naive ... bench:   3,116,045 ns/iter (+/- 12,387)
test mandel_simd4 ... bench:     953,249 ns/iter (+/- 3,220)

fannkuch-redux: 5.11 9.45
fannkuch-redux (12) 65.3 130
spectral-norm: 7.85 7.91
spectral-norm (5500): 26.3 26.6
nbody: 29.5 61.8


## arm
test inverse_naive   ... bench:      13,788 ns/iter (+/- 4,540)
test inverse_simd4   ... bench:       6,906 ns/iter (+/- 1,027)
test multiply_naive  ... bench:       8,178 ns/iter (+/- 11,187)
test multiply_simd4  ... bench:       2,033 ns/iter (+/- 22)
test transpose_naive ... bench:       1,023 ns/iter (+/- 24)
test transpose_simd4 ... bench:         545 ns/iter (+/- 742)

test mandel_naive ... bench:   2,639,999 ns/iter (+/- 7,036)
test mandel_simd4 ... bench:     885,036 ns/iter (+/- 2,876)

fannkuch-redux 3.36 4.50
fannkuch-redux (12) 42.8 70