Struct parquet::arrow::arrow_reader::ArrowReaderBuilder
source · pub struct ArrowReaderBuilder<T> { /* private fields */ }
Expand description
Builder for constructing parquet readers into arrow.
Most users should use one of the following specializations:
- synchronous API:
ParquetRecordBatchReaderBuilder::try_new
async
API:ParquetRecordBatchStreamBuilder::new
Implementations§
source§impl<T> ArrowReaderBuilder<T>
impl<T> ArrowReaderBuilder<T>
sourcepub fn metadata(&self) -> &Arc<ParquetMetaData>
pub fn metadata(&self) -> &Arc<ParquetMetaData>
Returns a reference to the ParquetMetaData
for this parquet file
sourcepub fn parquet_schema(&self) -> &SchemaDescriptor
pub fn parquet_schema(&self) -> &SchemaDescriptor
Returns the parquet SchemaDescriptor
for this parquet file
sourcepub fn with_batch_size(self, batch_size: usize) -> Self
pub fn with_batch_size(self, batch_size: usize) -> Self
Set the size of RecordBatch
to produce. Defaults to 1024
If the batch_size more than the file row count, use the file row count.
sourcepub fn with_row_groups(self, row_groups: Vec<usize>) -> Self
pub fn with_row_groups(self, row_groups: Vec<usize>) -> Self
Only read data from the provided row group indexes
This is also called row group filtering
sourcepub fn with_projection(self, mask: ProjectionMask) -> Self
pub fn with_projection(self, mask: ProjectionMask) -> Self
Only read data from the provided column indexes
sourcepub fn with_row_selection(self, selection: RowSelection) -> Self
pub fn with_row_selection(self, selection: RowSelection) -> Self
Provide a RowSelection
to filter out rows, and avoid fetching their
data into memory.
This feature is used to restrict which rows are decoded within row
groups, skipping ranges of rows that are not needed. Such selections
could be determined by evaluating predicates against the parquet page
Index
or some other external information available to a query
engine.
§Notes
Row group filtering (see Self::with_row_groups
) is applied prior to
applying the row selection, and therefore rows from skipped row groups
should not be included in the RowSelection
(see example below)
It is recommended to enable writing the page index if using this
functionality, to allow more efficient skipping over data pages. See
ArrowReaderOptions::with_page_index
.
§Example
Given a parquet file with 4 row groups, and a row group filter of [0, 2, 3]
, in order to scan rows 50-100 in row group 2 and rows 200-300 in
row group 3:
Row Group 0, 1000 rows (selected)
Row Group 1, 1000 rows (skipped)
Row Group 2, 1000 rows (selected, but want to only scan rows 50-100)
Row Group 3, 1000 rows (selected, but want to only scan rows 200-300)
You could pass the following RowSelection
:
Select 1000 (scan all rows in row group 0)
Skip 50 (skip the first 50 rows in row group 2)
Select 50 (scan rows 50-100 in row group 2)
Skip 900 (skip the remaining rows in row group 2)
Skip 200 (skip the first 200 rows in row group 3)
Select 100 (scan rows 200-300 in row group 3)
Skip 700 (skip the remaining rows in row group 3)
Note there is no entry for the (entirely) skipped row group 1.
Note you can represent the same selection with fewer entries. Instead of
Skip 900 (skip the remaining rows in row group 2)
Skip 200 (skip the first 200 rows in row group 3)
you could use
Skip 1100 (skip the remaining 900 rows in row group 2 and the first 200 rows in row group 3)
sourcepub fn with_row_filter(self, filter: RowFilter) -> Self
pub fn with_row_filter(self, filter: RowFilter) -> Self
Provide a RowFilter
to skip decoding rows
Row filters are applied after row group selection and row selection
It is recommended to enable reading the page index if using this functionality, to allow
more efficient skipping over data pages. See ArrowReaderOptions::with_page_index
.
sourcepub fn with_limit(self, limit: usize) -> Self
pub fn with_limit(self, limit: usize) -> Self
Provide a limit to the number of rows to be read
The limit will be applied after any Self::with_row_selection
and Self::with_row_filter
allowing it to limit the final set of rows decoded after any pushed down predicates
It is recommended to enable reading the page index if using this functionality, to allow
more efficient skipping over data pages. See ArrowReaderOptions::with_page_index
sourcepub fn with_offset(self, offset: usize) -> Self
pub fn with_offset(self, offset: usize) -> Self
Provide an offset to skip over the given number of rows
The offset will be applied after any Self::with_row_selection
and Self::with_row_filter
allowing it to skip rows after any pushed down predicates
It is recommended to enable reading the page index if using this functionality, to allow
more efficient skipping over data pages. See ArrowReaderOptions::with_page_index
source§impl<T: ChunkReader + 'static> ArrowReaderBuilder<SyncReader<T>>
impl<T: ChunkReader + 'static> ArrowReaderBuilder<SyncReader<T>>
sourcepub fn try_new(reader: T) -> Result<Self>
pub fn try_new(reader: T) -> Result<Self>
Create a new ParquetRecordBatchReaderBuilder
let mut builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
// Inspect metadata
assert_eq!(builder.metadata().num_row_groups(), 1);
// Construct reader
let mut reader: ParquetRecordBatchReader = builder.with_row_groups(vec![0]).build().unwrap();
// Read data
let _batch = reader.next().unwrap().unwrap();
sourcepub fn try_new_with_options(
reader: T,
options: ArrowReaderOptions,
) -> Result<Self>
pub fn try_new_with_options( reader: T, options: ArrowReaderOptions, ) -> Result<Self>
Create a new ParquetRecordBatchReaderBuilder
with ArrowReaderOptions
sourcepub fn new_with_metadata(input: T, metadata: ArrowReaderMetadata) -> Self
pub fn new_with_metadata(input: T, metadata: ArrowReaderMetadata) -> Self
Create a ParquetRecordBatchReaderBuilder
from the provided ArrowReaderMetadata
This interface allows:
-
Loading metadata once and using it to create multiple builders with potentially different settings or run on different threads
-
Using a cached copy of the metadata rather than re-reading it from the file each time a reader is constructed.
See the docs on ArrowReaderMetadata
for more details
§Example
let metadata = ArrowReaderMetadata::load(&file, Default::default()).unwrap();
let mut a = ParquetRecordBatchReaderBuilder::new_with_metadata(file.clone(), metadata.clone()).build().unwrap();
let mut b = ParquetRecordBatchReaderBuilder::new_with_metadata(file, metadata).build().unwrap();
// Should be able to read from both in parallel
assert_eq!(a.next().unwrap().unwrap(), b.next().unwrap().unwrap());
sourcepub fn build(self) -> Result<ParquetRecordBatchReader>
pub fn build(self) -> Result<ParquetRecordBatchReader>
Build a ParquetRecordBatchReader
Note: this will eagerly evaluate any RowFilter
before returning