Expand description
API for reading/writing Arrow RecordBatches and Arrays to/from
Parquet Files.
See the crate-level documentation for more details on other APIs
§Schema Conversion
These APIs ensure that data in Arrow RecordBatches written to Parquet are
read back as RecordBatches with the exact same types and values.
Parquet and Arrow have different type systems, and there is not
always a one to one mapping between the systems. For example, data
stored as a Parquet BYTE_ARRAY can be read as either an Arrow
BinaryViewArray or BinaryArray.
To recover the original Arrow types, the writers in this module add a “hint” to
the metadata in the ARROW_SCHEMA_META_KEY key which records the original Arrow
schema. The metadata hint follows the same convention as arrow-cpp based
implementations such as pyarrow. The reader looks for the schema hint in the
metadata to determine Arrow types, and if it is not present, infers the Arrow schema
from the Parquet schema.
In situations where the embedded Arrow schema is not compatible with the Parquet schema, the Parquet schema takes precedence and no error is raised. See #1663
You can also control the type conversion process in more detail using:
- 
ArrowSchemaConvertercontrol the conversion of Arrow types to Parquet types.
- 
ArrowReaderOptions::with_schemato explicitly specify your own Arrow schema hint to use when reading Parquet, overriding any metadata that may be present.
§Example: Writing Arrow RecordBatch to Parquet file
 let ids = Int32Array::from(vec![1, 2, 3, 4]);
 let vals = Int32Array::from(vec![5, 6, 7, 8]);
 let batch = RecordBatch::try_from_iter(vec![
   ("id", Arc::new(ids) as ArrayRef),
   ("val", Arc::new(vals) as ArrayRef),
 ]).unwrap();
 let file = tempfile().unwrap();
 // WriterProperties can be used to set Parquet file options
 let props = WriterProperties::builder()
     .set_compression(Compression::SNAPPY)
     .build();
 let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props)).unwrap();
 writer.write(&batch).expect("Writing batch");
 // writer must be closed to write footer
 writer.close().unwrap();§Example: Reading Parquet file into Arrow RecordBatch
let file = File::open("data.parquet").unwrap();
let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
println!("Converted arrow schema is: {}", builder.schema());
let mut reader = builder.build().unwrap();
let record_batch = reader.next().unwrap().unwrap();
println!("Read {} records.", record_batch.num_rows());§Example: Reading non-uniformly encrypted parquet file into arrow record batch
Note: This requires the experimental encryption feature to be enabled at compile time.
 let file = File::open(path).unwrap();
 // Define the AES encryption keys required required for decrypting the footer metadata
 // and column-specific data. If only a footer key is used then it is assumed that the
 // file uses uniform encryption and all columns are encrypted with the footer key.
 // If any column keys are specified, other columns without a key provided are assumed
 // to be unencrypted
 let footer_key = "0123456789012345".as_bytes(); // Keys are 128 bits (16 bytes)
 let column_1_key = "1234567890123450".as_bytes();
 let column_2_key = "1234567890123451".as_bytes();
 let decryption_properties = FileDecryptionProperties::builder(footer_key.to_vec())
     .with_column_key("double_field", column_1_key.to_vec())
     .with_column_key("float_field", column_2_key.to_vec())
     .build()
     .unwrap();
 let options = ArrowReaderOptions::default()
  .with_file_decryption_properties(decryption_properties);
 let reader_metadata = ArrowReaderMetadata::load(&file, options.clone()).unwrap();
 let file_metadata = reader_metadata.metadata().file_metadata();
 assert_eq!(50, file_metadata.num_rows());
 let mut reader = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options)
   .unwrap()
   .build()
   .unwrap();
 let record_batch = reader.next().unwrap().unwrap();
 assert_eq!(50, record_batch.num_rows());Re-exports§
- pub use self::arrow_writer::ArrowWriter;
- pub use self::async_reader::ParquetRecordBatchStreamBuilder;
- pub use self::async_writer::AsyncArrowWriter;
Modules§
- arrow_reader 
- Contains reader which reads parquet data into arrow RecordBatch
- arrow_writer 
- Contains writer which writes arrow data into parquet data.
- async_reader 
- asyncAPI for reading Parquet files as- RecordBatches
- async_writer 
- asyncAPI for writing- RecordBatches to Parquet files
Structs§
- ArrowSchema Converter 
- Converter for Arrow schema to Parquet schema
- FieldLevels 
- Schema information necessary to decode a parquet file as arrow Fields
- ProjectionMask 
- A ProjectionMaskidentifies a set of columns within a potentially nested schema to project
Constants§
- ARROW_SCHEMA_ META_ KEY 
- Schema metadata key used to store serialized Arrow schema
- PARQUET_FIELD_ ID_ META_ KEY 
- The value of this metadata key, if present on Field::metadata, will be used to populateBasicTypeInfo::id
Functions§
- add_encoded_ arrow_ schema_ to_ metadata 
- Mutates writer metadata by storing the encoded Arrow schema hint in
ARROW_SCHEMA_META_KEY.
- arrow_to_ parquet_ schema Deprecated 
- Convert arrow schema to parquet schema
- encode_arrow_ schema 
- Encodes the Arrow schema into the IPC format, and base64 encodes it
- parquet_column 
- Lookups up the parquet column by name
- parquet_to_ arrow_ field_ levels 
- Convert a parquet SchemaDescriptortoFieldLevels
- parquet_to_ arrow_ schema 
- Convert Parquet schema to Arrow schema including optional metadata
- parquet_to_ arrow_ schema_ by_ columns 
- Convert parquet schema to arrow schema including optional metadata, only preserving some leaf columns.