Expand description
This crate contains the official Native Rust implementation of Apache Parquet, part of the Apache Arrow project. The crate provides a number of APIs to read and write Parquet files, covering a range of use cases.
Please see the parquet crates.io page for feature flags and tips to improve performance.
§Format Overview
Parquet is a columnar format, which means that unlike row formats like CSV, values are iterated along columns instead of rows. Parquet is similar in spirit to Arrow, but focuses on storage efficiency whereas Arrow prioritizes compute efficiency.
Parquet files are partitioned for scalability. Each file contains metadata, along with zero or more “row groups”, each row group containing one or more columns. The APIs in this crate reflect this structure.
Data in Parquet files is strongly typed and differentiates between logical
and physical types (see schema). In addition, Parquet files may contain
other metadata, such as statistics, which can be used to optimize reading
(see file::metadata).
For more details about the Parquet format itself, see the Parquet spec
§APIs
This crate exposes a number of APIs for different use-cases.
§Metadata and Schema
The schema module provides APIs to work with Parquet schemas. The
file::metadata module provides APIs to work with Parquet metadata.
§Reading and Writing Arrow (arrow feature)
The arrow module supports reading and writing Parquet data to/from
Arrow RecordBatches. Using Arrow is simple and performant, and allows workloads
to leverage the wide range of data transforms provided by the arrow crate, and by the
ecosystem of Arrow compatible systems.
Most users will use ArrowWriter for writing and ParquetRecordBatchReaderBuilder for
reading.
Lower level APIs include ArrowColumnWriter for writing using multiple
threads, and RowFilter to apply filters during decode.
§async Reading and Writing Arrow (async feature)
The async_reader and async_writer modules provide async APIs to
read and write RecordBatches asynchronously.
Most users will use AsyncArrowWriter for writing and ParquetRecordBatchStreamBuilder
for reading. When the object_store feature is enabled, ParquetObjectReader
provides efficient integration with object storage services such as S3 via the object_store
crate, automatically optimizing IO based on any predicates or projections provided.
§Read/Write Parquet Directly
Workloads needing finer-grained control, or to avoid a dependence on arrow,
can use the APIs in file directly. These APIs are harder to use
as they directly use the underlying Parquet data model, and require knowledge
of the Parquet format, including the details of Dremel record shredding
and Logical Types.
Modules§
- arrow
- API for reading/writing Arrow
RecordBatches andArrays to/from Parquet Files. - basic
- Contains Rust mappings for Thrift definition.
Refer to
parquet.thriftfile to see raw definitions. - bloom_
filter - Bloom filter implementation specific to Parquet, as described in the spec.
- column
- Low level column reader and writer APIs.
- data_
type - Data types that connect Parquet physical types with their Rust-specific representations.
- errors
- Common Parquet errors and macros.
- file
- APIs for reading parquet data.
- format
- Automatically generated code from the Parquet thrift definition.
- record
- Contains record-based API for reading Parquet files.
- schema
- Parquet schema definitions and methods to print and parse schema.
- thrift
- Custom thrift definitions
- utf8
check_valid_utf8validation function