Expand description
This crate contains the official Native Rust implementation of Apache Parquet, part of the Apache Arrow project. The crate provides a number of APIs to read and write Parquet files, covering a range of use cases.
Please see the parquet crates.io page for feature flags and tips to improve performance.
§Format Overview
Parquet is a columnar format, which means that unlike row formats like CSV, values are iterated along columns instead of rows. Parquet is similar in spirit to Arrow, with Parquet focusing on storage efficiency whereas Arrow prioritizes compute efficiency.
Parquet files are partitioned for scalability. Each file contains metadata, along with zero or more “row groups”, each row group containing one or more columns. The APIs in this crate reflect this structure.
Parquet distinguishes between “logical” and “physical” data types. For instance, strings (logical type) are stored as byte arrays (physical type). Likewise, temporal types like dates, times, timestamps, etc. (logical type) are stored as integers (physical type). This crate exposes both kinds of types.
For more details about the Parquet format, see the Parquet spec.
§APIs
This crate exposes a number of APIs for different use-cases.
§Read/Write Arrow
The arrow
module allows reading and writing Parquet data to/from Arrow RecordBatch
.
This makes for a simple and performant interface to parquet data, whilst allowing workloads
to leverage the wide range of data transforms provided by the arrow crate, and by the
ecosystem of libraries and services using Arrow as an interop format.
§Read/Write Arrow Async
When the async
feature is enabled, [arrow::async_reader
] and [arrow::async_writer
]
provide the ability to read and write arrow
data asynchronously. Additionally, with the
object_store
feature is enabled, ParquetObjectReader
provides efficient integration with object storage services such as S3 via the object_store
crate, automatically optimizing IO based on any predicates or projections provided.
§Read/Write Parquet
Workloads needing finer-grained control, or looking to not take a dependency on arrow,
can use the lower-level APIs in file
. These APIs expose the underlying parquet
data model, and therefore require knowledge of the underlying parquet format,
including the details of Dremel record shredding and Logical Types. Most workloads
should prefer the arrow interfaces.
Modules§
- High-level API for reading/writing Arrow RecordBatches and Arrays to/from Parquet Files.
- Contains Rust mappings for Thrift definition. Refer to
parquet.thrift
file to see raw definitions. - Bloom filter implementation specific to Parquet, as described in the spec.
- Low level column reader and writer APIs.
- Data types that connect Parquet physical types with their Rust-specific representations.
- Common Parquet errors and macros.
- Low level APIs for reading raw parquet data.
- Automatically generated code for reading parquet thrift definition.
- Contains record-based API for reading Parquet files.
- Parquet schema definitions and methods to print and parse schema.
- Custom thrift definitions