io_ipc
only.Expand description
APIs to read from and write to Arrow’s IPC format.
Inter-process communication is a method through which different processes share and pass data between them. Its use-cases include parallel processing of chunks of data across different CPU cores, transferring data between different Apache Arrow implementations in other languages and more. Under the hood Apache Arrow uses FlatBuffers as its binary protocol, so every Arrow-centered streaming or serialiation problem that could be solved using FlatBuffers could probably be solved using the more integrated approach that is exposed in this module.
Arrow’s IPC protocol allows only batch or dictionary columns to be passed around due to its reliance on a pre-defined data scheme. This constraint provides a large performance gain because serialized data will always have a known structutre, i.e. the same fields and datatypes, with the only variance being the number of rows and the actual data inside the Batch. This dramatically increases the deserialization rate, as the bytes in the file or stream are already structured “correctly”.
Reading and writing IPC messages is done using one of two variants - either
FileReader
<-> FileWriter
or
StreamReader
<-> StreamWriter
.
These two variants wrap a type T
that implements Read
, and in
the case of the File
variant it also implements Seek
. In
practice it means that File
s can be arbitrarily accessed while Stream
s are only
read in certain order - the one they were written in (first in, first out).
Modules§
- A struct adapter of Read+Seek+Write to append to IPC files
- APIs to read Arrow’s IPC format.
- APIs to write to Arrow’s IPC format.
Structs§
- Struct containing
dictionary_id
and nestedIpcField
, allowing users to specify the dictionary ids of the IPC fields when writing to IPC. - Struct containing fields and whether the file is written in little or big endian.