Struct parquet::file::metadata::ParquetMetaDataReader

source ·
pub struct ParquetMetaDataReader { /* private fields */ }
Expand description

Reads the ParquetMetaData from a byte stream.

See crate::file::metadata::ParquetMetaDataWriter for a description of the Parquet metadata.

Parquet metadata is not necessarily contiguous in the files: part is stored in the footer (the last bytes of the file), but other portions (such as the PageIndex) can be stored elsewhere.

This reader handles reading the footer as well as the non contiguous parts of the metadata such as the page indexes; excluding Bloom Filters.

§Example

// read parquet metadata including page indexes from a file
let file = open_parquet_file("some_path.parquet");
let mut reader = ParquetMetaDataReader::new()
    .with_page_indexes(true);
reader.try_parse(&file).unwrap();
let metadata = reader.finish().unwrap();
assert!(metadata.column_index().is_some());
assert!(metadata.offset_index().is_some());

Implementations§

source§

impl ParquetMetaDataReader

source

pub fn new() -> Self

Create a new ParquetMetaDataReader

source

pub fn new_with_metadata(metadata: ParquetMetaData) -> Self

Create a new ParquetMetaDataReader populated with a ParquetMetaData struct obtained via other means.

source

pub fn with_page_indexes(self, val: bool) -> Self

Enable or disable reading the page index structures described in “Parquet page index: Layout to Support Page Skipping”. Equivalent to: self.with_column_indexes(val).with_offset_indexes(val)

source

pub fn with_column_indexes(self, val: bool) -> Self

Enable or disable reading the Parquet ColumnIndex structure.

source

pub fn with_offset_indexes(self, val: bool) -> Self

Enable or disable reading the Parquet OffsetIndex structure.

source

pub fn with_prefetch_hint(self, prefetch: Option<usize>) -> Self

Provide a hint as to the number of bytes needed to fully parse the ParquetMetaData. Only used for the asynchronous [Self::try_load()] method.

By default, the reader will first fetch the last 8 bytes of the input file to obtain the size of the footer metadata. A second fetch will be performed to obtain the needed bytes. After parsing the footer metadata, a third fetch will be performed to obtain the bytes needed to decode the page index structures, if they have been requested. To avoid unnecessary fetches, prefetch can be set to an estimate of the number of bytes needed to fully decode the ParquetMetaData, which can reduce the number of fetch requests and reduce latency. Setting prefetch too small will not trigger an error, but will result in extra fetches being performed.

source

pub fn has_metadata(&self) -> bool

Indicates whether this reader has a ParquetMetaData internally.

source

pub fn finish(&mut self) -> Result<ParquetMetaData>

Return the parsed ParquetMetaData struct, leaving None in its place.

source

pub fn parse_and_finish<R: ChunkReader>( self, reader: &R, ) -> Result<ParquetMetaData>

Given a ChunkReader, parse and return the ParquetMetaData in a single pass.

If reader is Bytes based, then the buffer must contain sufficient bytes to complete the request, and must include the Parquet footer. If page indexes are desired, the buffer must contain the entire file, or Self::try_parse_sized() should be used.

This call will consume self.

§Example
// read parquet metadata including page indexes
let file = open_parquet_file("some_path.parquet");
let metadata = ParquetMetaDataReader::new()
    .with_page_indexes(true)
    .parse_and_finish(&file).unwrap();
source

pub fn try_parse<R: ChunkReader>(&mut self, reader: &R) -> Result<()>

Attempts to parse the footer metadata (and optionally page indexes) given a ChunkReader.

If reader is Bytes based, then the buffer must contain sufficient bytes to complete the request, and must include the Parquet footer. If page indexes are desired, the buffer must contain the entire file, or Self::try_parse_sized() should be used.

source

pub fn try_parse_sized<R: ChunkReader>( &mut self, reader: &R, file_size: usize, ) -> Result<()>

Same as Self::try_parse(), but provide the original file size in the case that reader is a Bytes struct that does not contain the entire file. This information is necessary when the page indexes are desired. reader must have access to the Parquet footer.

Using this function also allows for retrying with a larger buffer.

§Errors

This function will return ParquetError::IndexOutOfBound in the event reader does not provide enough data to fully parse the metadata (see example below).

Other errors returned include ParquetError::General and ParquetError::EOF.

§Example
let file = open_parquet_file("some_path.parquet");
let len = file.len() as usize;
let bytes = get_bytes(&file, 1000..len);
let mut reader = ParquetMetaDataReader::new().with_page_indexes(true);
match reader.try_parse_sized(&bytes, len) {
    Ok(_) => (),
    Err(ParquetError::IndexOutOfBound(needed, _)) => {
        let bytes = get_bytes(&file, len - needed..len);
        reader.try_parse_sized(&bytes, len).unwrap();
    }
    _ => panic!("unexpected error")
}
let metadata = reader.finish().unwrap();
source

pub fn read_page_indexes<R: ChunkReader>(&mut self, reader: &R) -> Result<()>

Read the page index structures when a ParquetMetaData has already been obtained. See Self::new_with_metadata() and Self::has_metadata().

source

pub fn read_page_indexes_sized<R: ChunkReader>( &mut self, reader: &R, file_size: usize, ) -> Result<()>

Read the page index structures when a ParquetMetaData has already been obtained. This variant is used when reader cannot access the entire Parquet file (e.g. it is a Bytes struct containing the tail of the file). See Self::new_with_metadata() and Self::has_metadata().

Decodes the Parquet footer returning the metadata length in bytes

A parquet footer is 8 bytes long and has the following layout:

  • 4 bytes for the metadata length
  • 4 bytes for the magic bytes ‘PAR1’
+-----+--------+
| len | 'PAR1' |
+-----+--------+
source

pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData>

Decodes ParquetMetaData from the provided bytes.

Typically this is used to decode the metadata from the end of a parquet file. The format of buf is the Thift compact binary protocol, as specified by the Parquet Spec.

Trait Implementations§

source§

impl Default for ParquetMetaDataReader

source§

fn default() -> ParquetMetaDataReader

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V

source§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,