Crate arrow

source ·
Expand description

A complete, safe, native Rust implementation of Apache Arrow, a cross-language development platform for in-memory data.

Please see the arrow crates.io page for feature flags and tips to improve performance.

§Columnar Format

The array module provides statically typed implementations of all the array types as defined by the Arrow Columnar Format

For example, an Int32Array represents a nullable array of i32

let array = Int32Array::from(vec![Some(1), None, Some(3)]);
assert_eq!(array.len(), 3);
assert_eq!(array.value(0), 1);
assert_eq!(array.is_null(1), true);

let collected: Vec<_> = array.iter().collect();
assert_eq!(collected, vec![Some(1), None, Some(3)]);
assert_eq!(array.values(), &[1, 0, 3])

It is also possible to write generic code. For example, the following is generic over all primitively typed arrays

fn sum<T: ArrowPrimitiveType>(array: &PrimitiveArray<T>) -> T::Native
where
    T: ArrowPrimitiveType,
    T::Native: Sum
{
    array.iter().map(|v| v.unwrap_or_default()).sum()
}

assert_eq!(sum(&Float32Array::from(vec![1.1, 2.9, 3.])), 7.);
assert_eq!(sum(&TimestampNanosecondArray::from(vec![1, 2, 3])), 6);

And the following is generic over all arrays with comparable values

fn min<T: ArrayAccessor>(array: T) -> Option<T::Item>
where
    T::Item: Ord
{
    ArrayIter::new(array).filter_map(|v| v).min()
}

assert_eq!(min(&Int32Array::from(vec![4, 2, 1, 6])), Some(1));
assert_eq!(min(&StringArray::from(vec!["b", "a", "c"])), Some("a"));

For more examples, and details consult the arrow_array docs.

§Type Erasure / Trait Objects

It is often the case that code wishes to handle any type of array, without necessarily knowing its concrete type. This use-case is catered for by a combination of Array and DataType, with the former providing a type-erased container for the array, and the latter identifying the concrete type of array.

fn impl_string(array: &StringArray) {}
fn impl_f32(array: &Float32Array) {}

fn impl_dyn(array: &dyn Array) {
    match array.data_type() {
        DataType::Utf8 => impl_string(array.as_any().downcast_ref().unwrap()),
        DataType::Float32 => impl_f32(array.as_any().downcast_ref().unwrap()),
        _ => unimplemented!()
    }
}

To facilitate downcasting, the AsArray extension trait can be used

fn impl_string(array: &StringArray) {}
fn impl_f32(array: &Float32Array) {}

fn impl_dyn(array: &dyn Array) {
    match array.data_type() {
        DataType::Utf8 => impl_string(array.as_string()),
        DataType::Float32 => impl_f32(array.as_primitive()),
        _ => unimplemented!()
    }
}

It is also common to want to write a function that returns one of a number of possible array implementations. ArrayRef is a type-alias for Arc<dyn Array> which is frequently used for this purpose

fn parse_to_primitive<'a, T, I>(iter: I) -> PrimitiveArray<T>
where
    T: ArrowPrimitiveType,
    T::Native: FromStr,
    I: IntoIterator<Item=&'a str>,
{
    PrimitiveArray::from_iter(iter.into_iter().map(|val| T::Native::from_str(val).ok()))
}

fn parse_strings<'a, I>(iter: I, to_data_type: DataType) -> ArrayRef
where
    I: IntoIterator<Item=&'a str>,
{
   match to_data_type {
       DataType::Int32 => Arc::new(parse_to_primitive::<Int32Type, _>(iter)) as _,
       DataType::UInt32 => Arc::new(parse_to_primitive::<UInt32Type, _>(iter)) as _,
       _ => unimplemented!()
   }
}

let array = parse_strings(["1", "2", "3"], DataType::Int32);
let integers = array.as_any().downcast_ref::<Int32Array>().unwrap();
assert_eq!(integers.values(), &[1, 2, 3])

§Compute Kernels

The compute module provides optimised implementations of many common operations, for example the parse_strings operation above could also be implemented as follows:

fn parse_strings<'a, I>(iter: I, to_data_type: &DataType) -> Result<ArrayRef>
where
    I: IntoIterator<Item=&'a str>,
{
    let array = StringArray::from_iter(iter.into_iter().map(Some));
    arrow::compute::cast(&array, to_data_type)
}

let array = parse_strings(["1", "2", "3"], &DataType::UInt32).unwrap();
let integers = array.as_any().downcast_ref::<UInt32Array>().unwrap();
assert_eq!(integers.values(), &[1, 2, 3])

This module also implements many common vertical operations:

let array = Int32Array::from_iter(0..100);
let predicate = gt_scalar(&array, 60).unwrap();
let filtered = filter(&array, &predicate).unwrap();

let expected = Int32Array::from_iter(61..100);
assert_eq!(&expected, filtered.as_primitive::<Int32Type>());

As well as some horizontal operations, such as:

§Tabular Representation

It is common to want to group one or more columns together into a tabular representation. This is provided by RecordBatch which combines a Schema and a corresponding list of ArrayRef.

let col_1 = Arc::new(Int32Array::from_iter([1, 2, 3])) as _;
let col_2 = Arc::new(Float32Array::from_iter([1., 6.3, 4.])) as _;

let batch = RecordBatch::try_from_iter([("col1", col_1), ("col_2", col_2)]).unwrap();

§IO

This crate provides readers and writers for various formats to/from RecordBatch

Parquet is published as a separate crate

§Serde Compatibility

[arrow_json::reader::Decoder] provides a mechanism to convert arbitrary, serde-compatible structures into RecordBatch.

Whilst likely less performant than implementing a custom builder, as described in arrow_array::builder, this provides a simple mechanism to get up and running quickly

#[derive(Serialize)]
struct MyStruct {
    int32: i32,
    string: String,
}

let schema = Schema::new(vec![
    Field::new("int32", DataType::Int32, false),
    Field::new("string", DataType::Utf8, false),
]);

let rows = vec![
    MyStruct{ int32: 5, string: "bar".to_string() },
    MyStruct{ int32: 8, string: "foo".to_string() },
];

let mut decoder = ReaderBuilder::new(Arc::new(schema)).build_decoder().unwrap();
decoder.serialize(&rows).unwrap();

let batch = decoder.flush().unwrap().unwrap();

// Expect batch containing two columns
let int32 = batch.column(0).as_primitive::<Int32Type>();
assert_eq!(int32.values(), &[5, 8]);

let string = batch.column(1).as_string::<i32>();
assert_eq!(string.value(0), "bar");
assert_eq!(string.value(1), "foo");

§Crate Topology

The arrow project is implemented as multiple sub-crates, which are then re-exported by this top-level crate.

Crate authors can choose to depend on this top-level crate, or just the sub-crates they need.

The current list of sub-crates is:

  • arrow-arith - arithmetic kernels
  • arrow-array - type-safe arrow array abstractions
  • arrow-buffer - buffer abstractions for arrow arrays
  • arrow-cast - cast kernels for arrow arrays
  • [arrow-csv][arrow_csv] - read/write CSV to arrow format
  • arrow-data - the underlying data of arrow arrays
  • [arrow-ipc][arrow_ipc] - read/write IPC to arrow format
  • [arrow-json][arrow_json] - read/write JSON to arrow format
  • arrow-ord - ordering kernels for arrow arrays
  • arrow-row - comparable row format
  • arrow-schema - the logical types for arrow arrays
  • arrow-select - selection kernels for arrow arrays
  • arrow-string - string kernels for arrow arrays

Some functionality is also distributed independently of this crate:

§Safety and Security

Like many crates, this crate makes use of unsafe where prudent. However, it endeavours to be sound. Specifically, it should not be possible to trigger undefined behaviour using safe APIs.

If you think you have found an instance where this is possible, please file a ticket in our issue tracker and it will be triaged and fixed. For more information on arrow’s use of unsafe, see here.

§Higher-level Processing

This crate aims to provide reusable, low-level primitives for operating on columnar data. For more sophisticated query processing workloads, consider checking out DataFusion. This orchestrates the primitives exported by this crate into an embeddable query engine, with SQL and DataFrame frontends, and heavily influences this crate’s roadmap.

Re-exports§

Modules§

Macros§