polars_arrow::io

Module ipc

Source
Available on crate feature io_ipc only.
Expand description

APIs to read from and write to Arrow’s IPC format.

Inter-process communication is a method through which different processes share and pass data between them. Its use-cases include parallel processing of chunks of data across different CPU cores, transferring data between different Apache Arrow implementations in other languages and more. Under the hood Apache Arrow uses FlatBuffers as its binary protocol, so every Arrow-centered streaming or serialiation problem that could be solved using FlatBuffers could probably be solved using the more integrated approach that is exposed in this module.

Arrow’s IPC protocol allows only batch or dictionary columns to be passed around due to its reliance on a pre-defined data scheme. This constraint provides a large performance gain because serialized data will always have a known structutre, i.e. the same fields and datatypes, with the only variance being the number of rows and the actual data inside the Batch. This dramatically increases the deserialization rate, as the bytes in the file or stream are already structured “correctly”.

Reading and writing IPC messages is done using one of two variants - either FileReader <-> FileWriter or StreamReader <-> StreamWriter. These two variants wrap a type T that implements Read, and in the case of the File variant it also implements Seek. In practice it means that Files can be arbitrarily accessed while Streams are only read in certain order - the one they were written in (first in, first out).

Modules§

  • A struct adapter of Read+Seek+Write to append to IPC files
  • APIs to read Arrow’s IPC format.
  • APIs to write to Arrow’s IPC format.

Structs§

  • Struct containing dictionary_id and nested IpcField, allowing users to specify the dictionary ids of the IPC fields when writing to IPC.
  • Struct containing fields and whether the file is written in little or big endian.