Struct parquet::file::metadata::ParquetMetaDataReader
source · pub struct ParquetMetaDataReader { /* private fields */ }
Expand description
Reads the ParquetMetaData
from a byte stream.
See crate::file::metadata::ParquetMetaDataWriter
for a description of
the Parquet metadata.
Parquet metadata is not necessarily contiguous in the files: part is stored in the footer (the last bytes of the file), but other portions (such as the PageIndex) can be stored elsewhere.
This reader handles reading the footer as well as the non contiguous parts of the metadata such as the page indexes; excluding Bloom Filters.
§Example
// read parquet metadata including page indexes from a file
let file = open_parquet_file("some_path.parquet");
let mut reader = ParquetMetaDataReader::new()
.with_page_indexes(true);
reader.try_parse(&file).unwrap();
let metadata = reader.finish().unwrap();
assert!(metadata.column_index().is_some());
assert!(metadata.offset_index().is_some());
Implementations§
source§impl ParquetMetaDataReader
impl ParquetMetaDataReader
sourcepub fn new() -> Self
pub fn new() -> Self
Create a new ParquetMetaDataReader
sourcepub fn new_with_metadata(metadata: ParquetMetaData) -> Self
pub fn new_with_metadata(metadata: ParquetMetaData) -> Self
Create a new ParquetMetaDataReader
populated with a ParquetMetaData
struct
obtained via other means.
sourcepub fn with_page_indexes(self, val: bool) -> Self
pub fn with_page_indexes(self, val: bool) -> Self
Enable or disable reading the page index structures described in
“Parquet page index: Layout to Support Page Skipping”. Equivalent to:
self.with_column_indexes(val).with_offset_indexes(val)
sourcepub fn with_column_indexes(self, val: bool) -> Self
pub fn with_column_indexes(self, val: bool) -> Self
Enable or disable reading the Parquet ColumnIndex structure.
sourcepub fn with_offset_indexes(self, val: bool) -> Self
pub fn with_offset_indexes(self, val: bool) -> Self
Enable or disable reading the Parquet OffsetIndex structure.
sourcepub fn with_prefetch_hint(self, prefetch: Option<usize>) -> Self
pub fn with_prefetch_hint(self, prefetch: Option<usize>) -> Self
Provide a hint as to the number of bytes needed to fully parse the ParquetMetaData
.
Only used for the asynchronous [Self::try_load()
] method.
By default, the reader will first fetch the last 8 bytes of the input file to obtain the
size of the footer metadata. A second fetch will be performed to obtain the needed bytes.
After parsing the footer metadata, a third fetch will be performed to obtain the bytes
needed to decode the page index structures, if they have been requested. To avoid
unnecessary fetches, prefetch
can be set to an estimate of the number of bytes needed
to fully decode the ParquetMetaData
, which can reduce the number of fetch requests and
reduce latency. Setting prefetch
too small will not trigger an error, but will result
in extra fetches being performed.
sourcepub fn has_metadata(&self) -> bool
pub fn has_metadata(&self) -> bool
Indicates whether this reader has a ParquetMetaData
internally.
sourcepub fn finish(&mut self) -> Result<ParquetMetaData>
pub fn finish(&mut self) -> Result<ParquetMetaData>
Return the parsed ParquetMetaData
struct, leaving None
in its place.
sourcepub fn parse_and_finish<R: ChunkReader>(
self,
reader: &R,
) -> Result<ParquetMetaData>
pub fn parse_and_finish<R: ChunkReader>( self, reader: &R, ) -> Result<ParquetMetaData>
Given a ChunkReader
, parse and return the ParquetMetaData
in a single pass.
If reader
is Bytes
based, then the buffer must contain sufficient bytes to complete
the request, and must include the Parquet footer. If page indexes are desired, the buffer
must contain the entire file, or Self::try_parse_sized()
should be used.
This call will consume self
.
§Example
// read parquet metadata including page indexes
let file = open_parquet_file("some_path.parquet");
let metadata = ParquetMetaDataReader::new()
.with_page_indexes(true)
.parse_and_finish(&file).unwrap();
sourcepub fn try_parse<R: ChunkReader>(&mut self, reader: &R) -> Result<()>
pub fn try_parse<R: ChunkReader>(&mut self, reader: &R) -> Result<()>
Attempts to parse the footer metadata (and optionally page indexes) given a ChunkReader
.
If reader
is Bytes
based, then the buffer must contain sufficient bytes to complete
the request, and must include the Parquet footer. If page indexes are desired, the buffer
must contain the entire file, or Self::try_parse_sized()
should be used.
sourcepub fn try_parse_sized<R: ChunkReader>(
&mut self,
reader: &R,
file_size: usize,
) -> Result<()>
pub fn try_parse_sized<R: ChunkReader>( &mut self, reader: &R, file_size: usize, ) -> Result<()>
Same as Self::try_parse()
, but provide the original file size in the case that reader
is a Bytes
struct that does not contain the entire file. This information is necessary
when the page indexes are desired. reader
must have access to the Parquet footer.
Using this function also allows for retrying with a larger buffer.
§Errors
This function will return ParquetError::IndexOutOfBound
in the event reader
does not
provide enough data to fully parse the metadata (see example below).
Other errors returned include ParquetError::General
and ParquetError::EOF
.
§Example
let file = open_parquet_file("some_path.parquet");
let len = file.len() as usize;
let bytes = get_bytes(&file, 1000..len);
let mut reader = ParquetMetaDataReader::new().with_page_indexes(true);
match reader.try_parse_sized(&bytes, len) {
Ok(_) => (),
Err(ParquetError::IndexOutOfBound(needed, _)) => {
let bytes = get_bytes(&file, len - needed..len);
reader.try_parse_sized(&bytes, len).unwrap();
}
_ => panic!("unexpected error")
}
let metadata = reader.finish().unwrap();
sourcepub fn read_page_indexes<R: ChunkReader>(&mut self, reader: &R) -> Result<()>
pub fn read_page_indexes<R: ChunkReader>(&mut self, reader: &R) -> Result<()>
Read the page index structures when a ParquetMetaData
has already been obtained.
See Self::new_with_metadata()
and Self::has_metadata()
.
sourcepub fn read_page_indexes_sized<R: ChunkReader>(
&mut self,
reader: &R,
file_size: usize,
) -> Result<()>
pub fn read_page_indexes_sized<R: ChunkReader>( &mut self, reader: &R, file_size: usize, ) -> Result<()>
Read the page index structures when a ParquetMetaData
has already been obtained.
This variant is used when reader
cannot access the entire Parquet file (e.g. it is
a Bytes
struct containing the tail of the file).
See Self::new_with_metadata()
and Self::has_metadata()
.
Decodes the Parquet footer returning the metadata length in bytes
A parquet footer is 8 bytes long and has the following layout:
- 4 bytes for the metadata length
- 4 bytes for the magic bytes ‘PAR1’
+-----+--------+
| len | 'PAR1' |
+-----+--------+
sourcepub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData>
pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData>
Decodes ParquetMetaData
from the provided bytes.
Typically this is used to decode the metadata from the end of a parquet
file. The format of buf
is the Thift compact binary protocol, as specified
by the Parquet Spec.