pub struct StatisticsConverter<'a> { /* private fields */ }
Expand description

Extracts Parquet statistics as Arrow arrays

This is used to convert Parquet statistics to Arrow ArrayRef, with proper type conversions. This information can be used for pruning Parquet files, row groups, and data pages based on the statistics embedded in Parquet metadata.

§Schemas

The converter uses the schema of the Parquet file and the Arrow schema to convert the underlying statistics value (stored as a parquet value) into the corresponding Arrow value. For example, Decimals are stored as binary in parquet files and this structure handles mapping them to the i128 representation used in Arrow.

Note: The Parquet schema and Arrow schema do not have to be identical (for example, the columns may be in different orders and one or the other schemas may have additional columns). The function parquet_column is used to match the column in the Parquet schema to the column in the Arrow schema.

Implementations§

source§

impl<'a> StatisticsConverter<'a>

source

pub fn parquet_column_index(&self) -> Option<usize>

Return the index of the column in the Parquet schema, if any

Returns None if the column is was present in the Arrow schema, but not present in the parquet file

source

pub fn arrow_field(&self) -> &'a Field

Return the arrow schema’s [Field] of the column in the Arrow schema

source

pub fn with_missing_null_counts_as_zero( self, missing_null_counts_as_zero: bool, ) -> Self

Set the statistics converter to treat missing null counts as missing

By default, the converter will treat missing null counts as though the null count is known to be 0.

Note that parquet files written by parquet-rs currently do not store null counts even when it is known there are zero nulls, and the reader will return 0 for the null counts in that instance. This behavior may change in a future release.

Both parquet-java and parquet-cpp store null counts as 0 when there are no nulls, and don’t write unknown values to the null count field.

source

pub fn row_group_row_counts<I>( &self, metadatas: I, ) -> Result<Option<UInt64Array>>
where I: IntoIterator<Item = &'a RowGroupMetaData>,

Returns a UInt64Array with row counts for each row group

§Return Value

The returned array has no nulls, and has one value for each row group. Each value is the number of rows in the row group.

§Example
// Given the metadata for a parquet file and the arrow schema
let metadata: ParquetMetaData = get_parquet_metadata();
let arrow_schema: Schema = get_arrow_schema();
let parquet_schema = metadata.file_metadata().schema_descr();
// create a converter
let converter = StatisticsConverter::try_new("foo", &arrow_schema, parquet_schema)
  .unwrap();
// get the row counts for each row group
let row_counts = converter.row_group_row_counts(metadata
  .row_groups()
  .iter()
).unwrap();
// file had 2 row groups, with 1024 and 23 rows respectively
assert_eq!(row_counts, Some(UInt64Array::from(vec![1024, 23])));
source

pub fn try_new<'b>( column_name: &'b str, arrow_schema: &'a Schema, parquet_schema: &'a SchemaDescriptor, ) -> Result<Self>

Create a new StatisticsConverter to extract statistics for a column

Note if there is no corresponding column in the parquet file, the returned arrays will be null. This can happen if the column is in the arrow schema but not in the parquet schema due to schema evolution.

See example on Self::row_group_mins for usage

§Errors
  • If the column is not found in the arrow schema
source

pub fn row_group_mins<I>(&self, metadatas: I) -> Result<ArrayRef>
where I: IntoIterator<Item = &'a RowGroupMetaData>,

Extract the minimum values from row group statistics in RowGroupMetaData

§Return Value

The returned array contains 1 value for each row group, in the same order as metadatas

Each value is either

  • the minimum value for the column
  • a null value, if the statistics can not be extracted

Note that a null value does NOT mean the min value was actually null it means it the requested statistic is unknown

§Errors

Reasons for not being able to extract the statistics include:

  • the column is not present in the parquet file
  • statistics for the column are not present in the row group
  • the stored statistic value can not be converted to the requested type
§Example
// Given the metadata for a parquet file and the arrow schema
let metadata: ParquetMetaData = get_parquet_metadata();
let arrow_schema: Schema = get_arrow_schema();
let parquet_schema = metadata.file_metadata().schema_descr();
// create a converter
let converter = StatisticsConverter::try_new("foo", &arrow_schema, parquet_schema)
  .unwrap();
// get the minimum value for the column "foo" in the parquet file
let min_values: ArrayRef = converter
  .row_group_mins(metadata.row_groups().iter())
 .unwrap();
// if "foo" is a Float64 value, the returned array will contain Float64 values
assert_eq!(min_values, Arc::new(Float64Array::from(vec![Some(1.0), Some(2.0)])) as _);
source

pub fn row_group_maxes<I>(&self, metadatas: I) -> Result<ArrayRef>
where I: IntoIterator<Item = &'a RowGroupMetaData>,

Extract the maximum values from row group statistics in RowGroupMetaData

See docs on Self::row_group_mins for details

source

pub fn row_group_null_counts<I>(&self, metadatas: I) -> Result<UInt64Array>
where I: IntoIterator<Item = &'a RowGroupMetaData>,

Extract the null counts from row group statistics in RowGroupMetaData

See docs on Self::row_group_mins for details

source

pub fn data_page_mins<I>( &self, column_page_index: &ParquetColumnIndex, column_offset_index: &ParquetOffsetIndex, row_group_indices: I, ) -> Result<ArrayRef>
where I: IntoIterator<Item = &'a usize>,

Extract the minimum values from Data Page statistics.

In Parquet files, in addition to the Column Chunk level statistics (stored for each column for each row group) there are also optional statistics stored for each data page, as part of the ParquetColumnIndex.

Since a single Column Chunk is stored as one or more pages, page level statistics can prune at a finer granularity.

However since they are stored in a separate metadata structure (Index) there is different code to extract them as compared to arrow statistics.

§Parameters:
  • column_page_index: The parquet column page indices, read from ParquetMetaData column_index

  • column_offset_index: The parquet column offset indices, read from ParquetMetaData offset_index

  • row_group_indices: The indices of the row groups, that are used to extract the column page index and offset index on a per row group per column basis.

§Return Value

The returned array contains 1 value for each NativeIndex in the underlying Indexes, in the same order as they appear in metadatas.

For example, if there are two Indexes in metadatas:

  1. the first having 3 PageIndex entries
  2. the second having 2 PageIndex entries

The returned array would have 5 rows.

Each value is either:

  • the minimum value for the page
  • a null value, if the statistics can not be extracted

Note that a null value does NOT mean the min value was actually null it means it the requested statistic is unknown

§Errors

Reasons for not being able to extract the statistics include:

  • the column is not present in the parquet file
  • statistics for the pages are not present in the row group
  • the stored statistic value can not be converted to the requested type
source

pub fn data_page_maxes<I>( &self, column_page_index: &ParquetColumnIndex, column_offset_index: &ParquetOffsetIndex, row_group_indices: I, ) -> Result<ArrayRef>
where I: IntoIterator<Item = &'a usize>,

Extract the maximum values from Data Page statistics.

See docs on Self::data_page_mins for details.

source

pub fn data_page_null_counts<I>( &self, column_page_index: &ParquetColumnIndex, column_offset_index: &ParquetOffsetIndex, row_group_indices: I, ) -> Result<UInt64Array>
where I: IntoIterator<Item = &'a usize>,

Returns a UInt64Array with null counts for each data page.

See docs on Self::data_page_mins for details.

source

pub fn data_page_row_counts<I>( &self, column_offset_index: &ParquetOffsetIndex, row_group_metadatas: &'a [RowGroupMetaData], row_group_indices: I, ) -> Result<Option<UInt64Array>>
where I: IntoIterator<Item = &'a usize>,

Returns a UInt64Array with row counts for each data page.

This function iterates over the given row group indexes and computes the row count for each page in the specified column.

§Parameters:
  • column_offset_index: The parquet column offset indices, read from ParquetMetaData offset_index

  • row_group_metadatas: The metadata slice of the row groups, read from ParquetMetaData row_groups

  • row_group_indices: The indices of the row groups, that are used to extract the column offset index on a per row group per column basis.

See docs on Self::data_page_mins for details.

Trait Implementations§

source§

impl<'a> Debug for StatisticsConverter<'a>

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V

source§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,