matrixmultiply 0.1.7


General matrix multiplication for f32, f64 matrices.

Allows arbitrary row, column strided matrices.

Uses the same microkernel algorithm as BLIS_, but in a much simpler
and less featureful implementation.
See their multithreading_ page for a very good diagram over how
the algorithm partitions the matrix (*Note:* this crate does not implement
multithreading).

.. _BLIS: https://github.com/flame/blis

.. _multithreading: https://github.com/flame/blis/wiki/Multithreading

Please read the `API documentation here`__

__ https://bluss.github.io/matrixmultiply/

|build_status|_ |crates|_

.. |build_status| image:: https://travis-ci.org/bluss/matrixmultiply.svg?branch=master
.. _build_status: https://travis-ci.org/bluss/matrixmultiply

.. |crates| image:: https://meritbadge.herokuapp.com/matrixmultiply
.. _crates: https://crates.io/crates/matrixmultiply

**NOTE: Compile this crate using** ``RUSTFLAGS="-C target-cpu=native"`` **so
that the compiler can produce the best output.**

Recent Changes
--------------

- 0.1.7

  - Reduce overhead slightly for small matrix multiplication problems by using
    only one allocation call for both packing buffers.

- 0.1.6

  - Disable manual loop unrolling in debug mode (quicker debug builds)

- 0.1.5

  - Update sgemm to use a 4x8 microkernel (“still in simplistic rust”),
    which improves throughput by 10%.

- 0.1.4

  - Prepare support for aligned packed buffers
  - Update dgemm to use a 8x4 microkernel, still in simplistic rust,
    which improves throughput by 10-20% when using AVX.

- 0.1.3

  - Silence some debug prints

- 0.1.2

  - Major performance improvement for sgemm and dgemm (20-30% when using AVX).
    Since it all depends on what the optimizer does, I'd love to get
    issue reports that report good or bad performance.
  - Made the kernel masking generic, which is a cleaner design

- 0.1.1

  - Minor improvement in the kernel