pub struct ArrowWriter<W: Write> { /* private fields */ }
Expand description
Encodes RecordBatch
to parquet
Writes Arrow RecordBatch
es to a Parquet writer. Multiple RecordBatch
will be encoded
to the same row group, up to max_row_group_size
rows. Any remaining rows will be
flushed on close, leading the final row group in the output file to potentially
contain fewer than max_row_group_size
rows
let col = Arc::new(Int64Array::from_iter_values([1, 2, 3])) as ArrayRef;
let to_write = RecordBatch::try_from_iter([("col", col)]).unwrap();
let mut buffer = Vec::new();
let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None).unwrap();
writer.write(&to_write).unwrap();
writer.close().unwrap();
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
let read = reader.next().unwrap().unwrap();
assert_eq!(to_write, read);
§Memory Limiting
The nature of parquet forces buffering of an entire row group before it can be flushed to the underlying writer. Data is mostly buffered in its encoded form, reducing memory usage. However, some data such as dictionary keys or large strings or very nested data may still result in non-trivial memory usage.
See Also:
ArrowWriter::memory_size
: the current memory usage of the writer.ArrowWriter::in_progress_size
: Estimated size of the buffered row group,
Call Self::flush
to trigger an early flush of a row group based on a
memory threshold and/or global memory pressure. However, smaller row groups
result in higher metadata overheads, and thus may worsen compression ratios
and query performance.
writer.write(&batch).unwrap();
// Trigger an early flush if anticipated size exceeds 1_000_000
if writer.in_progress_size() > 1_000_000 {
writer.flush().unwrap();
}
§Type Support
The writer supports writing all Arrow DataType
s that have a direct mapping to
Parquet types including StructArray
and ListArray
.
The following are not supported:
IntervalMonthDayNanoArray
: Parquet does not support nanosecond intervals.
Implementations§
Source§impl<W: Write + Send> ArrowWriter<W>
impl<W: Write + Send> ArrowWriter<W>
Sourcepub fn try_new(
writer: W,
arrow_schema: SchemaRef,
props: Option<WriterProperties>,
) -> Result<Self>
pub fn try_new( writer: W, arrow_schema: SchemaRef, props: Option<WriterProperties>, ) -> Result<Self>
Try to create a new Arrow writer
The writer will fail if:
- a
SerializedFileWriter
cannot be created from the ParquetWriter - the Arrow schema contains unsupported datatypes such as Unions
Sourcepub fn try_new_with_options(
writer: W,
arrow_schema: SchemaRef,
options: ArrowWriterOptions,
) -> Result<Self>
pub fn try_new_with_options( writer: W, arrow_schema: SchemaRef, options: ArrowWriterOptions, ) -> Result<Self>
Try to create a new Arrow writer with ArrowWriterOptions
.
The writer will fail if:
- a
SerializedFileWriter
cannot be created from the ParquetWriter - the Arrow schema contains unsupported datatypes such as Unions
Sourcepub fn flushed_row_groups(&self) -> &[RowGroupMetaData]
pub fn flushed_row_groups(&self) -> &[RowGroupMetaData]
Returns metadata for any flushed row groups
Sourcepub fn memory_size(&self) -> usize
pub fn memory_size(&self) -> usize
Estimated memory usage, in bytes, of this ArrowWriter
This estimate is formed bu summing the values of
ArrowColumnWriter::memory_size
all in progress columns.
Sourcepub fn in_progress_size(&self) -> usize
pub fn in_progress_size(&self) -> usize
Anticipated encoded size of the in progress row group.
This estimate the row group size after being completely encoded is,
formed by summing the values of
ArrowColumnWriter::get_estimated_total_bytes
for all in progress
columns.
Sourcepub fn in_progress_rows(&self) -> usize
pub fn in_progress_rows(&self) -> usize
Returns the number of rows buffered in the in progress row group
Sourcepub fn bytes_written(&self) -> usize
pub fn bytes_written(&self) -> usize
Returns the number of bytes written by this instance
Sourcepub fn write(&mut self, batch: &RecordBatch) -> Result<()>
pub fn write(&mut self, batch: &RecordBatch) -> Result<()>
Encodes the provided RecordBatch
If this would cause the current row group to exceed WriterProperties::max_row_group_size
rows, the contents of batch
will be written to one or more row groups such that all but
the final row group in the file contain WriterProperties::max_row_group_size
rows.
This will fail if the batch
’s schema does not match the writer’s schema.
Sourcepub fn append_key_value_metadata(&mut self, kv_metadata: KeyValue)
pub fn append_key_value_metadata(&mut self, kv_metadata: KeyValue)
Additional KeyValue
metadata to be written in addition to those from WriterProperties
This method provide a way to append kv_metadata after write RecordBatch
Sourcepub fn inner_mut(&mut self) -> &mut W
pub fn inner_mut(&mut self) -> &mut W
Returns a mutable reference to the underlying writer.
It is inadvisable to directly write to the underlying writer, doing so will likely result in a corrupt parquet file
Sourcepub fn into_inner(self) -> Result<W>
pub fn into_inner(self) -> Result<W>
Flushes any outstanding data and returns the underlying writer.
Sourcepub fn finish(&mut self) -> Result<FileMetaData>
pub fn finish(&mut self) -> Result<FileMetaData>
Close and finalize the underlying Parquet writer
Unlike Self::close
this does not consume self
Attempting to write after calling finish will result in an error
Sourcepub fn close(self) -> Result<FileMetaData>
pub fn close(self) -> Result<FileMetaData>
Close and finalize the underlying Parquet writer