0
# Data Types and Schema
1
2
Comprehensive type system supporting primitive types, temporal data, nested structures, and schema validation with automatic type inference, casting capabilities, and interoperability with Arrow and other data formats.
3
4
## Capabilities
5
6
### Primitive Data Types
7
8
Fundamental numeric and boolean types with support for various precision levels and null value handling.
9
10
```python { .api }
11
# Integer Types
12
Int8: DataType # 8-bit signed integer
13
Int16: DataType # 16-bit signed integer
14
Int32: DataType # 32-bit signed integer
15
Int64: DataType # 64-bit signed integer
16
Int128: DataType # 128-bit signed integer
17
18
# Unsigned Integer Types
19
UInt8: DataType # 8-bit unsigned integer
20
UInt16: DataType # 16-bit unsigned integer
21
UInt32: DataType # 32-bit unsigned integer
22
UInt64: DataType # 64-bit unsigned integer
23
24
# Floating Point Types
25
Float32: DataType # 32-bit floating point
26
Float64: DataType # 64-bit floating point
27
28
# Decimal Type
29
Decimal: DataType # High-precision decimal type
30
31
# Boolean Type
32
Boolean: DataType # Boolean true/false
33
```
34
35
### String and Binary Types
36
37
Text and binary data types with categorical optimization and encoding support.
38
39
```python { .api }
40
# String Types
41
String: DataType # UTF-8 string type
42
Utf8: DataType # Alias for String (deprecated)
43
44
# Binary Type
45
Binary: DataType # Binary data type
46
47
# Categorical Types
48
Categorical: DataType # Categorical string type for efficiency
49
Enum: DataType # Enumerated string type with fixed categories
50
51
# Special Types
52
Null: DataType # Null type
53
Unknown: DataType # Unknown type placeholder
54
Object: DataType # Python object type
55
```
56
57
### Temporal Data Types
58
59
Date, time, and duration types with timezone support and various precision levels.
60
61
```python { .api }
62
# Date and Time Types
63
Date: DataType # Date type (days since epoch)
64
Time: DataType # Time of day type
65
Duration: DataType # Time duration type
66
67
# DateTime Type with timezone support
68
Datetime: DataType # DateTime with optional timezone
69
70
# DateTime constructor
71
def Datetime(time_unit="us", time_zone=None) -> DataType:
72
"""
73
Create datetime type with specified precision and timezone.
74
75
Parameters:
76
- time_unit: Precision ("ns", "us", "ms")
77
- time_zone: Timezone string (e.g., "UTC", "America/New_York")
78
79
Returns:
80
Datetime data type
81
"""
82
83
# Duration constructor
84
def Duration(time_unit="us") -> DataType:
85
"""
86
Create duration type with specified precision.
87
88
Parameters:
89
- time_unit: Precision ("ns", "us", "ms")
90
91
Returns:
92
Duration data type
93
"""
94
```
95
96
### Nested Data Types
97
98
Complex nested structures supporting lists, arrays, and structured data.
99
100
```python { .api }
101
# List Type (variable length)
102
List: DataType
103
104
def List(inner=None) -> DataType:
105
"""
106
Create list type with specified inner type.
107
108
Parameters:
109
- inner: Inner data type for list elements
110
111
Returns:
112
List data type
113
"""
114
115
# Array Type (fixed length)
116
Array: DataType
117
118
def Array(inner=None, width=None) -> DataType:
119
"""
120
Create array type with specified inner type and width.
121
122
Parameters:
123
- inner: Inner data type for array elements
124
- width: Fixed width of array
125
126
Returns:
127
Array data type
128
"""
129
130
# Struct Type
131
Struct: DataType
132
133
def Struct(fields=None) -> DataType:
134
"""
135
Create struct type with specified fields.
136
137
Parameters:
138
- fields: List of Field objects or dict mapping names to types
139
140
Returns:
141
Struct data type
142
"""
143
```
144
145
### Schema Management
146
147
Schema definition and validation with field specifications and type checking.
148
149
```python { .api }
150
class Schema:
151
def __init__(self, schema=None):
152
"""
153
Create schema from various inputs.
154
155
Parameters:
156
- schema: Dict mapping column names to types, list of Field objects, or existing Schema
157
"""
158
159
def names(self) -> list[str]:
160
"""Get column names in schema order."""
161
162
def dtypes(self) -> list[DataType]:
163
"""Get column data types in schema order."""
164
165
def len(self) -> int:
166
"""Get number of columns in schema."""
167
168
def __contains__(self, item) -> bool:
169
"""Check if column name exists in schema."""
170
171
def __getitem__(self, item) -> DataType:
172
"""Get data type for column name."""
173
174
def __iter__(self):
175
"""Iterate over (name, dtype) pairs."""
176
177
class Field:
178
def __init__(self, name: str, dtype: DataType):
179
"""
180
Create field definition.
181
182
Parameters:
183
- name: Field name
184
- dtype: Field data type
185
"""
186
187
@property
188
def name(self) -> str:
189
"""Field name."""
190
191
@property
192
def dtype(self) -> DataType:
193
"""Field data type."""
194
```
195
196
### Type Utilities and Checking
197
198
Functions for type inspection, validation, and conversion operations.
199
200
```python { .api }
201
def dtype_of(value) -> DataType:
202
"""
203
Get the data type of a value or expression.
204
205
Parameters:
206
- value: Value or expression to inspect
207
208
Returns:
209
Data type of the value
210
"""
211
212
class DataType:
213
def __eq__(self, other) -> bool:
214
"""Check type equality."""
215
216
def __ne__(self, other) -> bool:
217
"""Check type inequality."""
218
219
def __hash__(self) -> int:
220
"""Hash for use in sets/dicts."""
221
222
def __repr__(self) -> str:
223
"""String representation."""
224
225
def is_numeric(self) -> bool:
226
"""Check if type is numeric."""
227
228
def is_integer(self) -> bool:
229
"""Check if type is integer."""
230
231
def is_float(self) -> bool:
232
"""Check if type is floating point."""
233
234
def is_temporal(self) -> bool:
235
"""Check if type is temporal."""
236
237
def is_nested(self) -> bool:
238
"""Check if type is nested (List, Array, Struct)."""
239
```
240
241
### Categorical Types
242
243
Categorical and enumerated types for memory-efficient string handling with optional ordering.
244
245
```python { .api }
246
def Categorical(ordering=None) -> DataType:
247
"""
248
Create categorical type.
249
250
Parameters:
251
- ordering: Ordering type ("physical" or "lexical")
252
253
Returns:
254
Categorical data type
255
"""
256
257
def Enum(categories=None) -> DataType:
258
"""
259
Create enum type with fixed categories.
260
261
Parameters:
262
- categories: List of valid category strings
263
264
Returns:
265
Enum data type
266
"""
267
268
class Categories:
269
def __init__(self, categories=None):
270
"""
271
Create categories definition.
272
273
Parameters:
274
- categories: List of category strings
275
"""
276
```
277
278
### Decimal Type
279
280
High-precision decimal type for financial and scientific calculations requiring exact decimal representation.
281
282
```python { .api }
283
def Decimal(precision=None, scale=0) -> DataType:
284
"""
285
Create decimal type with specified precision and scale.
286
287
Parameters:
288
- precision: Total number of digits (default: inferred)
289
- scale: Number of digits after decimal point
290
291
Returns:
292
Decimal data type
293
"""
294
```
295
296
## Usage Examples
297
298
### Basic Type Creation and Usage
299
300
```python
301
import polars as pl
302
303
# Create DataFrame with explicit types
304
df = pl.DataFrame({
305
"id": [1, 2, 3],
306
"price": [10.5, 20.0, 15.75],
307
"category": ["A", "B", "A"],
308
"date": ["2023-01-01", "2023-01-02", "2023-01-03"]
309
}, schema={
310
"id": pl.Int32,
311
"price": pl.Float64,
312
"category": pl.Categorical,
313
"date": pl.Date
314
})
315
316
# Check schema
317
print(df.schema)
318
print(df.dtypes)
319
```
320
321
### Working with Temporal Types
322
323
```python
324
# Create datetime with timezone
325
dt_type = pl.Datetime("ms", "UTC")
326
327
# Create DataFrame with temporal types
328
df = pl.DataFrame({
329
"timestamp": ["2023-01-01T10:30:00", "2023-01-01T11:45:00"],
330
"date": ["2023-01-01", "2023-01-02"],
331
"duration": ["1h 30m", "2h 15m"]
332
}, schema={
333
"timestamp": pl.Datetime("ms", "UTC"),
334
"date": pl.Date,
335
"duration": pl.Duration("ms")
336
})
337
338
# Convert and work with temporal data
339
result = df.with_columns([
340
pl.col("timestamp").dt.hour().alias("hour"),
341
pl.col("date").dt.day().alias("day"),
342
pl.col("duration").dt.total_seconds().alias("duration_seconds")
343
])
344
```
345
346
### Nested Types: Lists and Structs
347
348
```python
349
# Working with List types
350
df = pl.DataFrame({
351
"id": [1, 2, 3],
352
"scores": [[85, 90, 88], [92, 87, 95], [78, 82, 85]]
353
}, schema={
354
"id": pl.Int32,
355
"scores": pl.List(pl.Int32)
356
})
357
358
# Operations on lists
359
result = df.with_columns([
360
pl.col("scores").list.mean().alias("avg_score"),
361
pl.col("scores").list.max().alias("max_score"),
362
pl.col("scores").list.len().alias("num_scores")
363
])
364
365
# Working with Struct types
366
df = pl.DataFrame({
367
"person": [
368
{"name": "Alice", "age": 25, "city": "NYC"},
369
{"name": "Bob", "age": 30, "city": "LA"},
370
]
371
}, schema={
372
"person": pl.Struct([
373
pl.Field("name", pl.String),
374
pl.Field("age", pl.Int32),
375
pl.Field("city", pl.String)
376
])
377
})
378
379
# Access struct fields
380
result = df.with_columns([
381
pl.col("person").struct.field("name").alias("name"),
382
pl.col("person").struct.field("age").alias("age")
383
])
384
```
385
386
### Type Casting and Conversion
387
388
```python
389
# Type casting
390
df = pl.DataFrame({
391
"int_col": [1, 2, 3],
392
"str_col": ["10", "20", "30"],
393
"float_col": [1.1, 2.2, 3.3]
394
})
395
396
# Cast between types
397
result = df.with_columns([
398
pl.col("int_col").cast(pl.Float64).alias("int_as_float"),
399
pl.col("str_col").cast(pl.Int32).alias("str_as_int"),
400
pl.col("float_col").cast(pl.String).alias("float_as_str")
401
])
402
403
# Safe casting with strict=False
404
result = df.with_columns([
405
pl.col("str_col").cast(pl.Int32, strict=False).alias("safe_cast")
406
])
407
```
408
409
### Schema Validation and Overrides
410
411
```python
412
# Define schema with validation
413
schema = pl.Schema({
414
"id": pl.Int64,
415
"name": pl.String,
416
"score": pl.Float64,
417
"category": pl.Categorical
418
})
419
420
# Create DataFrame with schema validation
421
df = pl.DataFrame({
422
"id": [1, 2, 3],
423
"name": ["Alice", "Bob", "Charlie"],
424
"score": [85.5, 92.0, 78.5],
425
"category": ["A", "B", "A"]
426
}, schema=schema)
427
428
# Schema overrides for specific columns
429
df = pl.DataFrame({
430
"values": ["1", "2", "3"]
431
}, schema_overrides={
432
"values": pl.Int32 # Override inferred String type
433
})
434
```
435
436
### Working with Categorical Data
437
438
```python
439
# Create categorical for memory efficiency
440
df = pl.DataFrame({
441
"id": [1, 2, 3, 4, 5],
442
"category": ["Small", "Large", "Medium", "Small", "Large"]
443
}, schema={
444
"category": pl.Categorical
445
})
446
447
# Enum with fixed categories
448
df = pl.DataFrame({
449
"size": ["S", "M", "L", "S", "M"]
450
}, schema={
451
"size": pl.Enum(["S", "M", "L", "XL"])
452
})
453
454
# Operations on categorical data
455
result = df.group_by("category").agg([
456
pl.col("id").count().alias("count")
457
])
458
```
459
460
### High-Precision Decimal Arithmetic
461
462
```python
463
# Financial calculations with exact precision
464
df = pl.DataFrame({
465
"amount": ["123.456789", "987.654321", "555.111222"]
466
}, schema={
467
"amount": pl.Decimal(precision=10, scale=6)
468
})
469
470
# Precise calculations
471
result = df.with_columns([
472
(pl.col("amount") * pl.lit("1.05")).alias("with_tax"),
473
pl.col("amount").round(2).alias("rounded")
474
])
475
```