The central type in Apache Arrow are arrays, which are a known-length sequence of values
all having the same type. This crate provides concrete implementations of each type, as
well as an [Array
] trait that can be used for type-erasure.
Building an Array
Most [Array
] implementations can be constructed directly from iterators or [Vec
]
# use ;
# use Int32Type;
#
from;
from;
from_iter;
from_iter;
from;
from;
from_iter;
from_iter_values;
;
Additionally ArrayBuilder
implementations can be
used to construct arrays with a push-based interface
# use Int16Array;
#
// Create a new builder with a capacity of 100
let mut builder = builder;
// Append a single primitive value
builder.append_value;
// Append a null value
builder.append_null;
// Append a slice of primitive values
builder.append_slice;
// Build the array
let array = builder.finish;
assert_eq!;
assert_eq!;
assert_eq!
Low-level API
Internally, arrays consist of one or more shared memory regions backed by a Buffer
,
the number and meaning of which depend on the array’s data type, as documented in
the Arrow specification.
For example, the type [Int16Array
] represents an array of 16-bit integers and consists of:
- An optional
NullBuffer
identifying any null values - A contiguous
ScalarBuffer<i16>
of values
Similarly, the type [StringArray
] represents an array of UTF-8 strings and consists of:
- An optional
NullBuffer
identifying any null values - An offsets
OffsetBuffer<i32>
identifying valid UTF-8 sequences within the values buffer - A values
Buffer
of UTF-8 encoded string data
Array constructors such as [PrimitiveArray::try_new
] provide the ability to cheaply
construct an array from these parts, with functions such as [PrimitiveArray::into_parts
]
providing the reverse operation.
# use ;
# use OffsetBuffer;
#
// Create a Int32Array from Vec without copying
let array = new;
assert_eq!;
assert_eq!;
// Create a StringArray from parts
let offsets = new;
let array = new;
let values: = array.iter.map.collect;
assert_eq!;
As Buffer
, and its derivatives, can be created from [Vec
] without copying, this provides
an efficient way to not only interoperate with other Rust code, but also implement kernels
optimised for the arrow data layout - e.g. by handling buffers instead of values.
Zero-Copy Slicing
Given an [Array
] of arbitrary length, it is possible to create an owned slice of this
data. Internally this just increments some ref-counts, and so is incredibly cheap
# use Int32Array;
let array = from_iter;
// Slice with offset 1 and length 2
let sliced = array.slice;
assert_eq!;
Downcasting an Array
Arrays are often passed around as a dynamically typed &dyn Array
or [ArrayRef
].
For example, RecordBatch
stores columns as [ArrayRef
].
Whilst these arrays can be passed directly to the compute
, csv
, json
, etc... APIs,
it is often the case that you wish to interact with the concrete arrays directly.
This requires downcasting to the concrete type of the array:
# use ;
// Safely downcast an `Array` to an `Int32Array` and compute the sum
// using native i32 values
// Safely downcasts the array to a `Float32Array` and returns a &[f32] view of the data
// Note: the values for positions corresponding to nulls will be arbitrary (but still valid f32)
The [cast::AsArray
] extension trait can make this more ergonomic
# use Array;
# use ;
# use Float32Type;