Struct parquet::arrow::arrow_writer::ArrowWriter
source · pub struct ArrowWriter<W: Write> { /* private fields */ }
Expand description
Encodes RecordBatch
to parquet
Writes Arrow RecordBatch
es to a Parquet writer. Multiple RecordBatch
will be encoded
to the same row group, up to max_row_group_size
rows. Any remaining rows will be
flushed on close, leading the final row group in the output file to potentially
contain fewer than max_row_group_size
rows
let col = Arc::new(Int64Array::from_iter_values([1, 2, 3])) as ArrayRef;
let to_write = RecordBatch::try_from_iter([("col", col)]).unwrap();
let mut buffer = Vec::new();
let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None).unwrap();
writer.write(&to_write).unwrap();
writer.close().unwrap();
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
let read = reader.next().unwrap().unwrap();
assert_eq!(to_write, read);
§Memory Limiting
The nature of parquet forces buffering of an entire row group before it can be flushed to the underlying writer. Data is buffered in its encoded form, to reduce memory usage, but if writing rows containing large strings or very nested data, this may still result in non-trivial memory usage.
ArrowWriter::in_progress_size
can be used to track the size of the buffered row group,
and potentially trigger an early flush of a row group based on a memory threshold and/or
global memory pressure. However, users should be aware that smaller row groups will result
in higher metadata overheads, and may worsen compression ratios and query performance.
writer.write(&batch).unwrap();
// Trigger an early flush if buffered size exceeds 1_000_000
if writer.in_progress_size() > 1_000_000 {
writer.flush().unwrap();
}
Implementations§
source§impl<W: Write + Send> ArrowWriter<W>
impl<W: Write + Send> ArrowWriter<W>
sourcepub fn try_new(
writer: W,
arrow_schema: SchemaRef,
props: Option<WriterProperties>,
) -> Result<Self>
pub fn try_new( writer: W, arrow_schema: SchemaRef, props: Option<WriterProperties>, ) -> Result<Self>
Try to create a new Arrow writer
The writer will fail if:
- a
SerializedFileWriter
cannot be created from the ParquetWriter - the Arrow schema contains unsupported datatypes such as Unions
sourcepub fn try_new_with_options(
writer: W,
arrow_schema: SchemaRef,
options: ArrowWriterOptions,
) -> Result<Self>
pub fn try_new_with_options( writer: W, arrow_schema: SchemaRef, options: ArrowWriterOptions, ) -> Result<Self>
Try to create a new Arrow writer with ArrowWriterOptions
.
The writer will fail if:
- a
SerializedFileWriter
cannot be created from the ParquetWriter - the Arrow schema contains unsupported datatypes such as Unions
sourcepub fn flushed_row_groups(&self) -> &[RowGroupMetaDataPtr]
pub fn flushed_row_groups(&self) -> &[RowGroupMetaDataPtr]
Returns metadata for any flushed row groups
sourcepub fn in_progress_size(&self) -> usize
pub fn in_progress_size(&self) -> usize
Returns the estimated length in bytes of the current in progress row group
sourcepub fn in_progress_rows(&self) -> usize
pub fn in_progress_rows(&self) -> usize
Returns the number of rows buffered in the in progress row group
sourcepub fn bytes_written(&self) -> usize
pub fn bytes_written(&self) -> usize
Returns the number of bytes written by this instance
sourcepub fn write(&mut self, batch: &RecordBatch) -> Result<()>
pub fn write(&mut self, batch: &RecordBatch) -> Result<()>
Encodes the provided RecordBatch
If this would cause the current row group to exceed WriterProperties::max_row_group_size
rows, the contents of batch
will be written to one or more row groups such that all but
the final row group in the file contain WriterProperties::max_row_group_size
rows.
This will fail if the batch
’s schema does not match the writer’s schema.
sourcepub fn append_key_value_metadata(&mut self, kv_metadata: KeyValue)
pub fn append_key_value_metadata(&mut self, kv_metadata: KeyValue)
Additional KeyValue
metadata to be written in addition to those from WriterProperties
This method provide a way to append kv_metadata after write RecordBatch
sourcepub fn inner_mut(&mut self) -> &mut W
pub fn inner_mut(&mut self) -> &mut W
Returns a mutable reference to the underlying writer.
It is inadvisable to directly write to the underlying writer, doing so will likely result in a corrupt parquet file
sourcepub fn into_inner(self) -> Result<W>
pub fn into_inner(self) -> Result<W>
Flushes any outstanding data and returns the underlying writer.
sourcepub fn finish(&mut self) -> Result<FileMetaData>
pub fn finish(&mut self) -> Result<FileMetaData>
Close and finalize the underlying Parquet writer
Unlike Self::close
this does not consume self
Attempting to write after calling finish will result in an error
sourcepub fn close(self) -> Result<FileMetaData>
pub fn close(self) -> Result<FileMetaData>
Close and finalize the underlying Parquet writer