Struct parquet::arrow::arrow_reader::ArrowReaderBuilder

source ·

pub struct ArrowReaderBuilder<T> { /* private fields */ }

Expand description

Builder for constructing parquet readers into arrow.

Most users should use one of the following specializations:

synchronous API: ParquetRecordBatchReaderBuilder::try_new
async API: ParquetRecordBatchStreamBuilder::new

Implementations§

source §

impl<T> ArrowReaderBuilder<T>

source

pub fn metadata(&self) -> &Arc<ParquetMetaData>

Returns a reference to the ParquetMetaData for this parquet file

source

pub fn parquet_schema(&self) -> &SchemaDescriptor

Returns the parquet SchemaDescriptor for this parquet file

source

pub fn schema(&self) -> &SchemaRef

Returns the arrow SchemaRef for this parquet file

source

pub fn with_batch_size(self, batch_size: usize) -> Self

Set the size of RecordBatch to produce. Defaults to 1024 If the batch_size more than the file row count, use the file row count.

source

pub fn with_row_groups(self, row_groups: Vec<usize>) -> Self

Only read data from the provided row group indexes

This is also called row group filtering

source

pub fn with_projection(self, mask: ProjectionMask) -> Self

Only read data from the provided column indexes

source

pub fn with_row_selection(self, selection: RowSelection) -> Self

Provide a RowSelection to filter out rows, and avoid fetching their data into memory.

This feature is used to restrict which rows are decoded within row groups, skipping ranges of rows that are not needed. Such selections could be determined by evaluating predicates against the parquet page Index or some other external information available to a query engine.

§Notes

Row group filtering (see Self::with_row_groups) is applied prior to applying the row selection, and therefore rows from skipped row groups should not be included in the RowSelection (see example below)

It is recommended to enable writing the page index if using this functionality, to allow more efficient skipping over data pages. See ArrowReaderOptions::with_page_index.

§Example

Given a parquet file with 4 row groups, and a row group filter of [0, 2, 3], in order to scan rows 50-100 in row group 2 and rows 200-300 in row group 3:

  Row Group 0, 1000 rows (selected)
  Row Group 1, 1000 rows (skipped)
  Row Group 2, 1000 rows (selected, but want to only scan rows 50-100)
  Row Group 3, 1000 rows (selected, but want to only scan rows 200-300)

You could pass the following RowSelection:

 Select 1000    (scan all rows in row group 0)
 Skip 50        (skip the first 50 rows in row group 2)
 Select 50      (scan rows 50-100 in row group 2)
 Skip 900       (skip the remaining rows in row group 2)
 Skip 200       (skip the first 200 rows in row group 3)
 Select 100     (scan rows 200-300 in row group 3)
 Skip 700       (skip the remaining rows in row group 3)

Note there is no entry for the (entirely) skipped row group 1.

Note you can represent the same selection with fewer entries. Instead of

 Skip 900       (skip the remaining rows in row group 2)
 Skip 200       (skip the first 200 rows in row group 3)

you could use

Skip 1100      (skip the remaining 900 rows in row group 2 and the first 200 rows in row group 3)

source

pub fn with_row_filter(self, filter: RowFilter) -> Self

Provide a RowFilter to skip decoding rows

Row filters are applied after row group selection and row selection

It is recommended to enable reading the page index if using this functionality, to allow more efficient skipping over data pages. See ArrowReaderOptions::with_page_index.

source

pub fn with_limit(self, limit: usize) -> Self

Provide a limit to the number of rows to be read

The limit will be applied after any Self::with_row_selection and Self::with_row_filter allowing it to limit the final set of rows decoded after any pushed down predicates

It is recommended to enable reading the page index if using this functionality, to allow more efficient skipping over data pages. See ArrowReaderOptions::with_page_index

source

pub fn with_offset(self, offset: usize) -> Self

Provide an offset to skip over the given number of rows

The offset will be applied after any Self::with_row_selection and Self::with_row_filter allowing it to skip rows after any pushed down predicates

It is recommended to enable reading the page index if using this functionality, to allow more efficient skipping over data pages. See ArrowReaderOptions::with_page_index

source §

impl<T: ChunkReader + 'static> ArrowReaderBuilder<SyncReader<T>>

source

pub fn try_new(reader: T) -> Result<Self>

Create a new ParquetRecordBatchReaderBuilder

let mut builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();

// Inspect metadata
assert_eq!(builder.metadata().num_row_groups(), 1);

// Construct reader
let mut reader: ParquetRecordBatchReader = builder.with_row_groups(vec![0]).build().unwrap();

// Read data
let _batch = reader.next().unwrap().unwrap();

source

pub fn try_new_with_options( reader: T, options: ArrowReaderOptions, ) -> Result<Self>

Create a new ParquetRecordBatchReaderBuilder with ArrowReaderOptions

source

pub fn new_with_metadata(input: T, metadata: ArrowReaderMetadata) -> Self

Create a ParquetRecordBatchReaderBuilder from the provided ArrowReaderMetadata

This interface allows:

Loading metadata once and using it to create multiple builders with potentially different settings or run on different threads
Using a cached copy of the metadata rather than re-reading it from the file each time a reader is constructed.

See the docs on ArrowReaderMetadata for more details

§Example

let metadata = ArrowReaderMetadata::load(&file, Default::default()).unwrap();
let mut a = ParquetRecordBatchReaderBuilder::new_with_metadata(file.clone(), metadata.clone()).build().unwrap();
let mut b = ParquetRecordBatchReaderBuilder::new_with_metadata(file, metadata).build().unwrap();

// Should be able to read from both in parallel
assert_eq!(a.next().unwrap().unwrap(), b.next().unwrap().unwrap());

source