0
# Polars u64-idx
1
2
Polars is a blazingly fast DataFrame library optimized for performance and memory efficiency. This variant provides 64-bit index support, enabling analysis of datasets with more than 4.2 billion rows. Built in Rust using Apache Arrow Columnar Format, it features lazy/eager execution, multi-threading, SIMD optimization, query optimization, and hybrid streaming for larger-than-RAM datasets.
3
4
## Package Information
5
6
- **Package Name**: polars-u64-idx
7
- **Language**: Python
8
- **Installation**: `pip install polars-u64-idx`
9
10
## Core Imports
11
12
```python
13
import polars as pl
14
```
15
16
For specific functionality:
17
18
```python
19
# Core data structures
20
from polars import DataFrame, Series, LazyFrame
21
22
# Data types
23
from polars import Int64, Float64, String, Date, Datetime
24
25
# Functions and expressions
26
from polars import col, lit, when, concat
27
```
28
29
## Basic Usage
30
31
```python
32
import polars as pl
33
34
# Create a DataFrame
35
df = pl.DataFrame({
36
"name": ["Alice", "Bob", "Charlie"],
37
"age": [25, 30, 35],
38
"city": ["New York", "London", "Tokyo"]
39
})
40
41
# Basic operations
42
result = (df
43
.filter(pl.col("age") > 28)
44
.select([
45
pl.col("name"),
46
pl.col("age"),
47
pl.col("city").alias("location")
48
])
49
.sort("age")
50
)
51
52
print(result)
53
54
# Lazy evaluation for larger datasets
55
lazy_df = (pl
56
.scan_csv("large_file.csv")
57
.filter(pl.col("amount") > 1000)
58
.group_by("category")
59
.agg([
60
pl.col("amount").sum().alias("total_amount"),
61
pl.col("id").count().alias("count")
62
])
63
)
64
65
# Execute the lazy computation
66
result = lazy_df.collect()
67
```
68
69
## Architecture
70
71
Polars uses a columnar data model built on Apache Arrow with several key components:
72
73
- **DataFrame/Series**: Eager evaluation data structures for immediate computation
74
- **LazyFrame**: Deferred evaluation with query optimization for better performance
75
- **Expressions**: Composable operations that work on columns (Expr class)
76
- **Data Types**: Comprehensive type system with 20+ types including nested types
77
- **I/O Engine**: Native support for 10+ file formats with lazy scanning capabilities
78
- **Query Engine**: Rust-based OLAP engine with predicate pushdown, projection pushdown, and streaming
79
80
The 64-bit index variant removes the 4.2 billion row limit of standard Polars, making it suitable for very large datasets while maintaining the same API and performance characteristics.
81
82
## Capabilities
83
84
### Core Data Structures
85
86
Primary data structures for working with tabular data, including eager DataFrame/Series for immediate operations and LazyFrame for optimized query execution.
87
88
```python { .api }
89
class DataFrame:
90
def __init__(self, data=None, schema=None, schema_overrides=None, orient=None, infer_schema_length=N_INFER_DEFAULT, nan_to_null=False): ...
91
def select(self, *exprs, **named_exprs) -> DataFrame: ...
92
def filter(self, *predicates, **constraints) -> DataFrame: ...
93
def with_columns(self, *exprs, **named_exprs) -> DataFrame: ...
94
def group_by(self, *by, maintain_order=False, **named_by) -> GroupBy: ...
95
def sort(self, by, *, descending=False, nulls_last=False, multithreaded=True) -> DataFrame: ...
96
def join(self, other, on=None, how="inner", *, left_on=None, right_on=None, suffix="_right", validate="m:m", join_nulls=False, coalesce=None) -> DataFrame: ...
97
98
class Series:
99
def __init__(self, name=None, values=None, dtype=None, strict=True, nan_to_null=False, dtype_if_empty=Null): ...
100
101
class LazyFrame:
102
def select(self, *exprs, **named_exprs) -> LazyFrame: ...
103
def filter(self, *predicates, **constraints) -> LazyFrame: ...
104
def collect(self, *, type_coercion=True, predicate_pushdown=True, projection_pushdown=True, simplify_expression=True, slice_pushdown=True, comm_subplan_elim=True, comm_subexpr_elim=True, cluster_with_columns=True, no_optimization=False, streaming=False, background=False, _eager=False) -> DataFrame: ...
105
```
106
107
[Core Data Structures](./core-data-structures.md)
108
109
### Expressions and Column Operations
110
111
Powerful expression system for column transformations, aggregations, and complex operations that work across DataFrame and LazyFrame.
112
113
```python { .api }
114
class Expr:
115
def alias(self, name: str) -> Expr: ...
116
def cast(self, dtype: DataType | type[Any], *, strict: bool = True) -> Expr: ...
117
def filter(self, predicate: Expr) -> Expr: ...
118
def sort(self, *, descending: bool = False, nulls_last: bool = False) -> Expr: ...
119
def sum(self) -> Expr: ...
120
def mean(self) -> Expr: ...
121
def max(self) -> Expr: ...
122
def min(self) -> Expr: ...
123
def count(self) -> Expr: ...
124
125
def col(name: str | DataType) -> Expr: ...
126
def lit(value: Any, dtype: DataType | None = None) -> Expr: ...
127
def when(predicate: Expr) -> When: ...
128
```
129
130
[Expressions and Column Operations](./expressions.md)
131
132
### Data Types and Schema
133
134
Comprehensive type system with numeric, text, temporal, and nested types, plus schema definition and validation capabilities.
135
136
```python { .api }
137
# Numeric types
138
class Int8: ...
139
class Int16: ...
140
class Int32: ...
141
class Int64: ...
142
class Int128: ...
143
class UInt8: ...
144
class UInt16: ...
145
class UInt32: ...
146
class UInt64: ...
147
class Float32: ...
148
class Float64: ...
149
class Decimal: ...
150
151
# Text types
152
class String: ...
153
class Binary: ...
154
155
# Temporal types
156
class Date: ...
157
class Datetime: ...
158
class Time: ...
159
class Duration: ...
160
161
# Special types
162
class Boolean: ...
163
class Categorical: ...
164
class Enum: ...
165
class List: ...
166
class Array: ...
167
class Struct: ...
168
169
class Schema:
170
def __init__(self, schema: Mapping[str, DataType] | Iterable[tuple[str, DataType]] | None = None): ...
171
```
172
173
[Data Types and Schema](./data-types.md)
174
175
### I/O Operations
176
177
Comprehensive I/O capabilities supporting 10+ file formats with both eager reading and lazy scanning for performance optimization.
178
179
```python { .api }
180
# CSV
181
def read_csv(source: str | Path | IO[str] | IO[bytes] | bytes, **kwargs) -> DataFrame: ...
182
def scan_csv(source: str | Path | list[str] | list[Path], **kwargs) -> LazyFrame: ...
183
184
# Parquet
185
def read_parquet(source: str | Path | IO[bytes] | bytes, **kwargs) -> DataFrame: ...
186
def scan_parquet(source: str | Path | list[str] | list[Path], **kwargs) -> LazyFrame: ...
187
188
# JSON
189
def read_json(source: str | Path | IO[str] | IO[bytes] | bytes, **kwargs) -> DataFrame: ...
190
def read_ndjson(source: str | Path | IO[str] | IO[bytes] | bytes, **kwargs) -> DataFrame: ...
191
192
# Database
193
def read_database(query: str, connection: str | ConnectionOrCursor, **kwargs) -> DataFrame: ...
194
195
# Excel
196
def read_excel(source: str | Path | IO[bytes] | bytes, **kwargs) -> DataFrame: ...
197
```
198
199
[I/O Operations](./io-operations.md)
200
201
### Functions and Utilities
202
203
Built-in functions for aggregation, transformations, date/time operations, string manipulation, and utility functions.
204
205
```python { .api }
206
# Aggregation functions
207
def sum(*exprs) -> Expr: ...
208
def mean(*exprs) -> Expr: ...
209
def max(*exprs) -> Expr: ...
210
def min(*exprs) -> Expr: ...
211
def count(*exprs) -> Expr: ...
212
def all(*exprs) -> Expr: ...
213
def any(*exprs) -> Expr: ...
214
215
# Date/time functions
216
def date(year: int | Expr, month: int | Expr, day: int | Expr) -> Expr: ...
217
def datetime(year: int | Expr, month: int | Expr, day: int | Expr, hour: int | Expr = 0, minute: int | Expr = 0, second: int | Expr = 0, microsecond: int | Expr = 0, *, time_unit: TimeUnit = "us", time_zone: str | None = None) -> Expr: ...
218
def date_range(start: date | datetime | IntoExpr, end: date | datetime | IntoExpr, interval: str | timedelta = "1d", *, closed: ClosedInterval = "both", time_unit: TimeUnit | None = None, time_zone: str | None = None, eager: bool = False) -> Expr | Series: ...
219
220
# String functions
221
def concat_str(exprs: IntoExpr, *, separator: str = "", ignore_nulls: bool = False) -> Expr: ...
222
```
223
224
[Functions and Utilities](./functions.md)
225
226
### SQL Interface
227
228
SQL query interface allowing standard SQL operations on DataFrames and integration with existing SQL workflows.
229
230
```python { .api }
231
class SQLContext:
232
def __init__(self, frames: dict[str, DataFrame | LazyFrame] | None = None, **named_frames: DataFrame | LazyFrame): ...
233
def execute(self, query: str, *, eager: bool = True) -> DataFrame | LazyFrame: ...
234
def register(self, name: str, frame: DataFrame | LazyFrame) -> SQLContext: ...
235
def unregister(self, name: str) -> SQLContext: ...
236
237
def sql(query: str, *, eager: bool = True, **named_frames: DataFrame | LazyFrame) -> DataFrame | LazyFrame: ...
238
```
239
240
[SQL Interface](./sql-interface.md)
241
242
## Error Handling
243
244
Polars provides a comprehensive exception hierarchy for different error scenarios:
245
246
```python { .api }
247
# Core exceptions
248
class PolarsError(Exception): ...
249
class ColumnNotFoundError(PolarsError): ...
250
class ComputeError(PolarsError): ...
251
class DuplicateError(PolarsError): ...
252
class InvalidOperationError(PolarsError): ...
253
class NoDataError(PolarsError): ...
254
class OutOfBoundsError(PolarsError): ...
255
class PanicException(PolarsError): ...
256
class SchemaError(PolarsError): ...
257
class SchemaFieldNotFoundError(PolarsError): ...
258
class ShapeError(PolarsError): ...
259
class SQLInterfaceError(PolarsError): ...
260
class SQLSyntaxError(PolarsError): ...
261
262
# Warnings
263
class PolarsWarning(Exception): ...
264
class PerformanceWarning(PolarsWarning): ...
265
```
266
267
All operations can raise these exceptions when encountering invalid data, schema mismatches, or computational errors. Proper exception handling should be used for production code.