Enum arrow_schema::DataType
source · pub enum DataType {
Show 39 variants
Null,
Boolean,
Int8,
Int16,
Int32,
Int64,
UInt8,
UInt16,
UInt32,
UInt64,
Float16,
Float32,
Float64,
Timestamp(TimeUnit, Option<Arc<str>>),
Date32,
Date64,
Time32(TimeUnit),
Time64(TimeUnit),
Duration(TimeUnit),
Interval(IntervalUnit),
Binary,
FixedSizeBinary(i32),
LargeBinary,
BinaryView,
Utf8,
LargeUtf8,
Utf8View,
List(FieldRef),
ListView(FieldRef),
FixedSizeList(FieldRef, i32),
LargeList(FieldRef),
LargeListView(FieldRef),
Struct(Fields),
Union(UnionFields, UnionMode),
Dictionary(Box<DataType>, Box<DataType>),
Decimal128(u8, i8),
Decimal256(u8, i8),
Map(FieldRef, bool),
RunEndEncoded(FieldRef, FieldRef),
}
Expand description
Datatypes supported by this implementation of Apache Arrow.
The variants of this enum include primitive fixed size types as well as
parametric or nested types. See Schema.fbs
for Arrow’s specification.
§Examples
Primitive types
// create a new 32-bit signed integer
let data_type = DataType::Int32;
Nested Types
// create a new list of 32-bit signed integers directly
let list_data_type = DataType::List(Arc::new(Field::new("item", DataType::Int32, true)));
// Create the same list type with constructor
let list_data_type2 = DataType::new_list(DataType::Int32, true);
assert_eq!(list_data_type, list_data_type2);
Dictionary Types
// String Dictionary (key type Int32 and value type Utf8)
let data_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
Timestamp Types
// timestamp with millisecond precision without timezone specified
let data_type = DataType::Timestamp(TimeUnit::Millisecond, None);
// timestamp with nanosecond precision in UTC timezone
let data_type = DataType::Timestamp(TimeUnit::Nanosecond, Some("UTC".into()));
§Display and FromStr
The Display
and FromStr
implementations for DataType
are
human-readable, parseable, and reversible.
let data_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
let data_type_string = data_type.to_string();
assert_eq!(data_type_string, "Dictionary(Int32, Utf8)");
// display can be parsed back into the original type
let parsed_data_type: DataType = data_type.to_string().parse().unwrap();
assert_eq!(data_type, parsed_data_type);
§Nested Support
Currently, the Rust implementation supports the following nested types:
List<T>
LargeList<T>
FixedSizeList<T>
Struct<T, U, V, ...>
Union<T, U, V, ...>
Map<K, V>
Nested types can themselves be nested within other arrays. For more information on these types please see the physical memory layout of Apache Arrow
Variants§
Null
Null type
Boolean
A boolean datatype representing the values true
and false
.
Int8
A signed 8-bit integer.
Int16
A signed 16-bit integer.
Int32
A signed 32-bit integer.
Int64
A signed 64-bit integer.
UInt8
An unsigned 8-bit integer.
UInt16
An unsigned 16-bit integer.
UInt32
An unsigned 32-bit integer.
UInt64
An unsigned 64-bit integer.
Float16
A 16-bit floating point number.
Float32
A 32-bit floating point number.
Float64
A 64-bit floating point number.
Timestamp(TimeUnit, Option<Arc<str>>)
A timestamp with an optional timezone.
Time is measured as a Unix epoch, counting the seconds from 00:00:00.000 on 1 January 1970, excluding leap seconds, as a signed 64-bit integer.
The time zone is a string indicating the name of a time zone, one of:
- As used in the Olson time zone database (the “tz database” or “tzdata”), such as “America/New_York”
- An absolute time zone offset of the form +XX:XX or -XX:XX, such as +07:30
§Timestamps with a non-empty timezone
If a Timestamp column has a non-empty timezone value, its epoch is 1970-01-01 00:00:00 (January 1st 1970, midnight) in the UTC timezone (the Unix epoch), regardless of the Timestamp’s own timezone.
Therefore, timestamp values with a non-empty timezone correspond to physical points in time together with some additional information about how the data was obtained and/or how to display it (the timezone).
For example, the timestamp value 0 with the timezone string “Europe/Paris” corresponds to “January 1st 1970, 00h00” in the UTC timezone, but the application may prefer to display it as “January 1st 1970, 01h00” in the Europe/Paris timezone (which is the same physical point in time).
One consequence is that timestamp values with a non-empty timezone can be compared and ordered directly, since they all share the same well-known point of reference (the Unix epoch).
§Timestamps with an unset / empty timezone
If a Timestamp column has no timezone value, its epoch is 1970-01-01 00:00:00 (January 1st 1970, midnight) in an unknown timezone.
Therefore, timestamp values without a timezone cannot be meaningfully interpreted as physical points in time, but only as calendar / clock indications (“wall clock time”) in an unspecified timezone.
For example, the timestamp value 0 with an empty timezone string corresponds to “January 1st 1970, 00h00” in an unknown timezone: there is not enough information to interpret it as a well-defined physical point in time.
One consequence is that timestamp values without a timezone cannot be reliably compared or ordered, since they may have different points of reference. In particular, it is not possible to interpret an unset or empty timezone as the same as “UTC”.
§Conversion between timezones
If a Timestamp column has a non-empty timezone, changing the timezone to a different non-empty value is a metadata-only operation: the timestamp values need not change as their point of reference remains the same (the Unix epoch).
However, if a Timestamp column has no timezone value, changing it to a non-empty value requires to think about the desired semantics. One possibility is to assume that the original timestamp values are relative to the epoch of the timezone being set; timestamp values should then adjusted to the Unix epoch (for example, changing the timezone from empty to “Europe/Paris” would require converting the timestamp values from “Europe/Paris” to “UTC”, which seems counter-intuitive but is nevertheless correct).
DataType::Timestamp(TimeUnit::Second, None);
DataType::Timestamp(TimeUnit::Second, Some("literal".into()));
DataType::Timestamp(TimeUnit::Second, Some("string".to_string().into()));
Date32
A signed 32-bit date representing the elapsed time since UNIX epoch (1970-01-01) in days.
Date64
A signed 64-bit date representing the elapsed time since UNIX epoch (1970-01-01) in milliseconds.
§Valid Ranges
According to the Arrow specification (Schema.fbs), values of Date64
are treated as the number of days, in milliseconds, since the UNIX
epoch. Therefore, values of this type must be evenly divisible by
86_400_000
, the number of milliseconds in a standard day.
It is not valid to store milliseconds that do not represent an exact day. The reason for this restriction is compatibility with other language’s native libraries (specifically Java), which historically lacked a dedicated date type and only supported timestamps.
§Validation
This library does not validate or enforce that Date64 values are evenly
divisible by 86_400_000
for performance and usability reasons. Date64
values are treated similarly to Timestamp(TimeUnit::Millisecond, None)
: values will be displayed with a time of day if the value does
not represent an exact day, and arithmetic will be done at the
millisecond granularity.
§Recommendation
Users should prefer DataType::Date32
to cleanly represent the number
of days, or one of the Timestamp variants to include time as part of the
representation, depending on their use case.
§Further Reading
For more details, see #5288.
Time32(TimeUnit)
A signed 32-bit time representing the elapsed time since midnight in the unit of TimeUnit
.
Must be either seconds or milliseconds.
Time64(TimeUnit)
A signed 64-bit time representing the elapsed time since midnight in the unit of TimeUnit
.
Must be either microseconds or nanoseconds.
Duration(TimeUnit)
Measure of elapsed time in either seconds, milliseconds, microseconds or nanoseconds.
Interval(IntervalUnit)
A “calendar” interval which models types that don’t necessarily have a precise duration without the context of a base timestamp (e.g. days can differ in length during day light savings time transitions).
Binary
Opaque binary data of variable length.
A single Binary array can store up to i32::MAX
bytes
of binary data in total.
FixedSizeBinary(i32)
Opaque binary data of fixed size. Enum parameter specifies the number of bytes per value.
LargeBinary
Opaque binary data of variable length and 64-bit offsets.
A single LargeBinary array can store up to i64::MAX
bytes
of binary data in total.
BinaryView
Opaque binary data of variable length.
Logically the same as Self::Binary
, but the internal representation uses a view
struct that contains the string length and either the string’s entire data
inline (for small strings) or an inlined prefix, an index of another buffer,
and an offset pointing to a slice in that buffer (for non-small strings).
Utf8
A variable-length string in Unicode with UTF-8 encoding.
A single Utf8 array can store up to i32::MAX
bytes
of string data in total.
LargeUtf8
A variable-length string in Unicode with UFT-8 encoding and 64-bit offsets.
A single LargeUtf8 array can store up to i64::MAX
bytes
of string data in total.
Utf8View
A variable-length string in Unicode with UTF-8 encoding
Logically the same as Self::Utf8
, but the internal representation uses a view
struct that contains the string length and either the string’s entire data
inline (for small strings) or an inlined prefix, an index of another buffer,
and an offset pointing to a slice in that buffer (for non-small strings).
List(FieldRef)
A list of some logical data type with variable length.
A single List array can store up to i32::MAX
elements in total.
ListView(FieldRef)
(NOT YET FULLY SUPPORTED) A list of some logical data type with variable length.
Note this data type is not yet fully supported. Using it with arrow APIs may result in panic
s.
The ListView layout is defined by three buffers: a validity bitmap, an offsets buffer, and an additional sizes buffer. Sizes and offsets are both 32 bits for this type
FixedSizeList(FieldRef, i32)
A list of some logical data type with fixed length.
LargeList(FieldRef)
A list of some logical data type with variable length and 64-bit offsets.
A single LargeList array can store up to i64::MAX
elements in total.
LargeListView(FieldRef)
(NOT YET FULLY SUPPORTED) A list of some logical data type with variable length and 64-bit offsets.
Note this data type is not yet fully supported. Using it with arrow APIs may result in panic
s.
The LargeListView layout is defined by three buffers: a validity bitmap, an offsets buffer, and an additional sizes buffer. Sizes and offsets are both 64 bits for this type
Struct(Fields)
A nested datatype that contains a number of sub-fields.
Union(UnionFields, UnionMode)
A nested datatype that can represent slots of differing types. Components:
UnionFields
- The type of union (Sparse or Dense)
Dictionary(Box<DataType>, Box<DataType>)
A dictionary encoded array (key_type
, value_type
), where
each array element is an index of key_type
into an
associated dictionary of value_type
.
Dictionary arrays are used to store columns of value_type
that contain many repeated values using less memory, but with
a higher CPU overhead for some operations.
This type mostly used to represent low cardinality string arrays or a limited set of primitive types as integers.
Decimal128(u8, i8)
Exact 128-bit width decimal value with precision and scale
- precision is the total number of digits
- scale is the number of digits past the decimal
For example the number 123.45 has precision 5 and scale 2.
In certain situations, scale could be negative number. For negative scale, it is the number of padding 0 to the right of the digits.
For example the number 12300 could be treated as a decimal has precision 3 and scale -2.
Decimal256(u8, i8)
Exact 256-bit width decimal value with precision and scale
- precision is the total number of digits
- scale is the number of digits past the decimal
For example the number 123.45 has precision 5 and scale 2.
In certain situations, scale could be negative number. For negative scale, it is the number of padding 0 to the right of the digits.
For example the number 12300 could be treated as a decimal has precision 3 and scale -2.
Map(FieldRef, bool)
A Map is a logical nested type that is represented as
List<entries: Struct<key: K, value: V>>
The keys and values are each respectively contiguous.
The key and value types are not constrained, but keys should be
hashable and unique.
Whether the keys are sorted can be set in the bool
after the Field
.
In a field with Map type, the field has a child Struct field, which then has two children: key type and the second the value type. The names of the child fields may be respectively “entries”, “key”, and “value”, but this is not enforced.
RunEndEncoded(FieldRef, FieldRef)
A run-end encoding (REE) is a variation of run-length encoding (RLE). These encodings are well-suited for representing data containing sequences of the same value, called runs. Each run is represented as a value and an integer giving the index in the array where the run ends.
A run-end encoded array has no buffers by itself, but has two child arrays. The first child array, called the run ends array, holds either 16, 32, or 64-bit signed integers. The actual values of each run are held in the second child array.
These child arrays are prescribed the standard names of “run_ends” and “values” respectively.
Implementations§
source§impl DataType
impl DataType
sourcepub fn is_primitive(&self) -> bool
pub fn is_primitive(&self) -> bool
Returns true if the type is primitive: (numeric, temporal).
sourcepub fn is_numeric(&self) -> bool
pub fn is_numeric(&self) -> bool
Returns true if this type is numeric: (UInt*, Int*, Float*, Decimal*).
sourcepub fn is_temporal(&self) -> bool
pub fn is_temporal(&self) -> bool
Returns true if this type is temporal: (Date*, Time*, Duration, or Interval).
sourcepub fn is_floating(&self) -> bool
pub fn is_floating(&self) -> bool
Returns true if this type is floating: (Float*).
sourcepub fn is_integer(&self) -> bool
pub fn is_integer(&self) -> bool
Returns true if this type is integer: (Int*, UInt*).
sourcepub fn is_signed_integer(&self) -> bool
pub fn is_signed_integer(&self) -> bool
Returns true if this type is signed integer: (Int*).
sourcepub fn is_unsigned_integer(&self) -> bool
pub fn is_unsigned_integer(&self) -> bool
Returns true if this type is unsigned integer: (UInt*).
sourcepub fn is_dictionary_key_type(&self) -> bool
pub fn is_dictionary_key_type(&self) -> bool
Returns true if this type is valid as a dictionary key
sourcepub fn is_run_ends_type(&self) -> bool
pub fn is_run_ends_type(&self) -> bool
Returns true if this type is valid for run-ends array in RunArray
sourcepub fn is_nested(&self) -> bool
pub fn is_nested(&self) -> bool
Returns true if this type is nested (List, FixedSizeList, LargeList, Struct, Union, or Map), or a dictionary of a nested type
sourcepub fn equals_datatype(&self, other: &DataType) -> bool
pub fn equals_datatype(&self, other: &DataType) -> bool
Compares the datatype with another, ignoring nested field names and metadata.
sourcepub fn primitive_width(&self) -> Option<usize>
pub fn primitive_width(&self) -> Option<usize>
Returns the byte width of this type if it is a primitive type
Returns None
if not a primitive type
sourcepub fn contains(&self, other: &DataType) -> bool
pub fn contains(&self, other: &DataType) -> bool
Check to see if self
is a superset of other
If DataType is a nested type, then it will check to see if the nested type is a superset of the other nested type else it will check to see if the DataType is equal to the other DataType
sourcepub fn new_list(data_type: DataType, nullable: bool) -> Self
pub fn new_list(data_type: DataType, nullable: bool) -> Self
Create a DataType::List
with elements of the specified type
and nullability, and conventionally named inner Field
("item"
).
To specify field level metadata, construct the inner Field
directly via Field::new
or Field::new_list_field
.
sourcepub fn new_large_list(data_type: DataType, nullable: bool) -> Self
pub fn new_large_list(data_type: DataType, nullable: bool) -> Self
Create a DataType::LargeList
with elements of the specified type
and nullability, and conventionally named inner Field
("item"
).
To specify field level metadata, construct the inner Field
directly via Field::new
or Field::new_list_field
.
sourcepub fn new_fixed_size_list(
data_type: DataType,
size: i32,
nullable: bool,
) -> Self
pub fn new_fixed_size_list( data_type: DataType, size: i32, nullable: bool, ) -> Self
Create a DataType::FixedSizeList
with elements of the specified type, size
and nullability, and conventionally named inner Field
("item"
).
To specify field level metadata, construct the inner Field
directly via Field::new
or Field::new_list_field
.
Trait Implementations§
source§impl FromStr for DataType
impl FromStr for DataType
Parses str
into a DataType
.
This is the reverse of DataType
’s Display
impl, and maintains the invariant that
DataType::try_from(&data_type.to_string()).unwrap() == data_type
§Example
use arrow_schema::DataType;
let data_type: DataType = "Int32".parse().unwrap();
assert_eq!(data_type, DataType::Int32);
source§impl Ord for DataType
impl Ord for DataType
source§impl PartialOrd for DataType
impl PartialOrd for DataType
impl Eq for DataType
impl StructuralPartialEq for DataType
Auto Trait Implementations§
impl Freeze for DataType
impl RefUnwindSafe for DataType
impl Send for DataType
impl Sync for DataType
impl Unpin for DataType
impl UnwindSafe for DataType
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
source§default unsafe fn clone_to_uninit(&self, dst: *mut T)
default unsafe fn clone_to_uninit(&self, dst: *mut T)
clone_to_uninit
)