Tessl Tile for pypi/datasets@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-dataset-classes.md data-loading.md dataset-building.md dataset-information.md dataset-operations.md features-and-types.md index.md

features-and-types.mddocs/

0
# Features and Type System
1

2
Comprehensive type system for defining and validating dataset schemas, supporting primitive types, complex nested structures, and multimedia data. The Features system enables schema validation, data encoding/decoding, and seamless integration with Apache Arrow for efficient data storage.
3

4
## Capabilities
5

6
### Features Container
7

8
The main schema container that defines the internal structure of a dataset as a dictionary mapping column names to feature types.
9

10
```python { .api }
11
class Features(dict):
12
    """A special dictionary that defines the internal structure of a dataset."""
13
    
14
    def __init__(self, *args, **kwargs): ...
15
    
16
    @classmethod
17
    def from_arrow_schema(cls, pa_schema) -> "Features": ...
18
    
19
    @classmethod
20
    def from_dict(cls, dic) -> "Features": ...
21
    
22
    def to_dict(self) -> dict: ...
23
    def encode_example(self, example: dict) -> dict: ...
24
    def decode_example(self, example: dict) -> dict: ...
25
    def encode_batch(self, batch: dict) -> dict: ...
26
    def decode_batch(self, batch: dict) -> dict: ...
27
    def flatten(self, max_depth: int = 16) -> "Features": ...
28
    def copy(self) -> "Features": ...
29
    def reorder_fields_as(self, other: "Features") -> "Features": ...
30
    
31
    # Properties
32
    @property
33
    def type(self): ...  # PyArrow DataType representation
34
    
35
    @property
36
    def arrow_schema(self): ...  # PyArrow Schema with metadata
37
```
38

39
**Usage Examples:**
40

41
```python
42
from datasets import Features, Value, ClassLabel, List
43

44
# Define dataset schema
45
features = Features({
46
    'text': Value('string'),
47
    'label': ClassLabel(names=['negative', 'positive']),
48
    'embeddings': List(Value('float32')),
49
    'metadata': {
50
        'source': Value('string'),
51
        'confidence': Value('float64')
52
    }
53
})
54

55
# Encode data for Arrow storage
56
example = {'text': 'Hello world', 'label': 'positive', 'embeddings': [0.1, 0.2]}
57
encoded = features.encode_example(example)
58

59
# Decode data with feature-specific logic
60
decoded = features.decode_example(encoded)
61
```
62

63
### Primitive Value Types
64

65
Feature type for scalar values with support for all Arrow data types including numeric, temporal, string, and binary types.
66

67
```python { .api }
68
class Value:
69
    """Scalar feature value of a particular data type."""
70
    
71
    def __init__(self, dtype: str, id: Optional[str] = None): ...
72
    def __call__(self): ...  # Returns PyArrow type
73
    def encode_example(self, value): ...
74
```
75

76
**Supported Data Types:**
77

78
- **Numeric:** `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, `uint64`
79
- **Floating:** `float16`, `float32`, `float64`
80
- **Temporal:** `time32[s|ms]`, `time64[us|ns]`, `timestamp[unit]`, `date32`, `date64`, `duration[unit]`
81
- **Decimal:** `decimal128(precision, scale)`, `decimal256(precision, scale)`
82
- **Binary:** `binary`, `large_binary`
83
- **String:** `string`, `large_string`
84
- **Other:** `null`, `bool`
85

86
**Usage Examples:**
87

88
```python
89
# Basic types
90
text_feature = Value('string')
91
integer_feature = Value('int64')
92
float_feature = Value('float32')
93
boolean_feature = Value('bool')
94

95
# Temporal types
96
timestamp_feature = Value('timestamp[ms]')
97
date_feature = Value('date32')
98

99
# High precision numbers
100
decimal_feature = Value('decimal128(10, 2)')
101
```
102

103
### Categorical Labels
104

105
Feature type for integer class labels with automatic string-to-integer conversion and label name management.
106

107
```python { .api }
108
class ClassLabel:
109
    """Feature type for integer class labels."""
110
    
111
    def __init__(
112
        self,
113
        num_classes: Optional[int] = None,
114
        names: Optional[List[str]] = None,
115
        names_file: Optional[str] = None,
116
        id: Optional[str] = None,
117
    ): ...
118
    
119
    def str2int(self, values: Union[str, Iterable]) -> Union[int, Iterable]: ...
120
    def int2str(self, values: Union[int, Iterable]) -> Union[str, Iterable]: ...
121
    def encode_example(self, example_data): ...
122
    def cast_storage(self, storage) -> pa.Int64Array: ...
123
```
124

125
**Usage Examples:**
126

127
```python
128
# Define with explicit names
129
sentiment = ClassLabel(names=['negative', 'neutral', 'positive'])
130

131
# Define with number of classes (creates 0, 1, 2, ...)
132
digits = ClassLabel(num_classes=10)
133

134
# Define from file
135
categories = ClassLabel(names_file='categories.txt')
136

137
# Convert between strings and integers
138
label_int = sentiment.str2int('positive')  # Returns 2
139
label_str = sentiment.int2str(2)  # Returns 'positive'
140

141
# Batch conversion
142
labels = sentiment.str2int(['positive', 'negative', 'positive'])  # [2, 0, 2]
143
```
144

145
### Arrays and Sequences
146

147
Feature types for list data with support for both variable-length and fixed-length sequences, including multi-dimensional arrays.
148

149
```python { .api }
150
class List:
151
    """Feature type for list data with 32-bit offsets."""
152
    
153
    def __init__(
154
        self,
155
        feature: Any,  # Child feature type
156
        length: int = -1,  # Fixed length (-1 = variable)
157
        id: Optional[str] = None,
158
    ): ...
159

160
class LargeList:
161
    """Feature type for large list data with 64-bit offsets."""
162
    
163
    def __init__(
164
        self,
165
        feature: Any,  # Child feature type
166
        id: Optional[str] = None,
167
    ): ...
168

169
class Sequence:
170
    """Utility for TensorFlow Datasets compatibility."""
171
    
172
    def __new__(cls, feature=None, length=-1, **kwargs): ...
173

174
class Array2D:
175
    """Create a two-dimensional array."""
176
    
177
    def __init__(self, shape: tuple, dtype: str): ...
178

179
class Array3D:
180
    """Create a three-dimensional array."""
181
    
182
    def __init__(self, shape: tuple, dtype: str): ...
183

184
class Array4D:
185
    """Create a four-dimensional array."""
186
    
187
    def __init__(self, shape: tuple, dtype: str): ...
188

189
class Array5D:
190
    """Create a five-dimensional array."""
191
    
192
    def __init__(self, shape: tuple, dtype: str): ...
193
```
194

195
**Usage Examples:**
196

197
```python
198
# Variable-length list of floats
199
embeddings = List(Value('float32'))
200

201
# Fixed-length list of 100 integers
202
fixed_sequence = List(Value('int32'), length=100)
203

204
# List of categorical labels
205
label_sequence = List(ClassLabel(names=['A', 'B', 'C']))
206

207
# Multi-dimensional arrays
208
image_array = Array3D(shape=(224, 224, 3), dtype='uint8')
209
feature_matrix = Array2D(shape=(50, 768), dtype='float32')
210

211
# Large lists for big data
212
large_embeddings = LargeList(Value('float64'))
213
```
214

215
### Audio Features
216

217
Feature type for audio data with automatic format handling and optional decoding control.
218

219
```python { .api }
220
class Audio:
221
    """Audio Feature to extract audio data from files."""
222
    
223
    def __init__(
224
        self,
225
        sampling_rate: Optional[int] = None,
226
        decode: bool = True,
227
        stream_index: Optional[int] = None,
228
        id: Optional[str] = None,
229
    ): ...
230
    
231
    def encode_example(self, value) -> dict: ...
232
    def decode_example(self, value, token_per_repo_id=None): ...
233
    def cast_storage(self, storage) -> pa.StructArray: ...
234
    def embed_storage(self, storage) -> pa.StructArray: ...
235
    def flatten(self) -> dict: ...
236
```
237

238
**Input Formats:**
239
- `str`: Absolute path to audio file
240
- `dict`: `{"path": str, "bytes": bytes}`
241
- `dict`: `{"array": ndarray, "sampling_rate": int}`
242

243
**Usage Examples:**
244

245
```python
246
# Basic audio feature
247
audio = Audio()
248

249
# Audio with specific sampling rate
250
speech = Audio(sampling_rate=16000)
251

252
# Audio without decoding (store as bytes)
253
raw_audio = Audio(decode=False)
254

255
# Use in dataset features
256
features = Features({
257
    'audio': Audio(sampling_rate=22050),
258
    'transcript': Value('string')
259
})
260
```
261

262
### Image Features
263

264
Feature type for image data with automatic format handling and optional PIL mode conversion.
265

266
```python { .api }
267
class Image:
268
    """Image Feature to read image data from files."""
269
    
270
    def __init__(
271
        self,
272
        mode: Optional[str] = None,  # PIL mode conversion
273
        decode: bool = True,
274
        id: Optional[str] = None,
275
    ): ...
276
    
277
    def encode_example(self, value) -> dict: ...
278
    def decode_example(self, value, token_per_repo_id=None): ...
279
    def cast_storage(self, storage) -> pa.StructArray: ...
280
    def embed_storage(self, storage) -> pa.StructArray: ...
281
    def flatten(self): ...
282
```
283

284
**Input Formats:**
285
- `str`: Absolute path to image file
286
- `dict`: `{"path": str, "bytes": bytes}`
287
- `np.ndarray`: NumPy array representing image
288
- `PIL.Image.Image`: PIL image object
289

290
**Usage Examples:**
291

292
```python
293
# Basic image feature
294
image = Image()
295

296
# Image with mode conversion
297
rgb_image = Image(mode='RGB')
298

299
# Image without decoding (store as bytes)
300
raw_image = Image(decode=False)
301

302
# Use in computer vision dataset
303
features = Features({
304
    'image': Image(mode='RGB'),
305
    'label': ClassLabel(names=['cat', 'dog']),
306
    'bbox': List(Value('float32'), length=4)
307
})
308
```
309

310
### Video Features
311

312
Feature type for video data with TorchCodec integration and flexible decoding options.
313

314
```python { .api }
315
class Video:
316
    """Video Feature to read video data from files."""
317
    
318
    def __init__(
319
        self,
320
        decode: bool = True,
321
        stream_index: Optional[int] = None,
322
        dimension_order: str = "NCHW",  # "NCHW" or "NHWC"
323
        num_ffmpeg_threads: int = 1,
324
        device: Optional[Union[str, "torch.device"]] = "cpu",
325
        seek_mode: str = "exact",  # "exact" or "approximate"
326
        id: Optional[str] = None,
327
    ): ...
328
    
329
    def encode_example(self, value): ...
330
    def decode_example(self, value, token_per_repo_id=None): ...
331
    def cast_storage(self, storage) -> pa.StructArray: ...
332
    def flatten(self): ...
333
```
334

335
**Usage Examples:**
336

337
```python
338
# Basic video feature
339
video = Video()
340

341
# Video with specific configuration
342
optimized_video = Video(
343
    dimension_order="NHWC",
344
    num_ffmpeg_threads=4,
345
    device="cuda",
346
    seek_mode="approximate"
347
)
348

349
# Video without decoding
350
raw_video = Video(decode=False)
351
```
352

353
### PDF Features
354

355
Feature type for PDF document processing with pdfplumber integration.
356

357
```python { .api }
358
class Pdf:
359
    """Pdf Feature to read PDF documents from files."""
360
    
361
    def __init__(
362
        self,
363
        decode: bool = True,
364
        id: Optional[str] = None,
365
    ): ...
366
    
367
    def encode_example(self, value) -> dict: ...
368
    def decode_example(self, value, token_per_repo_id=None): ...
369
    def cast_storage(self, storage) -> pa.StructArray: ...
370
    def embed_storage(self, storage) -> pa.StructArray: ...
371
    def flatten(self): ...
372
```
373

374
**Usage Examples:**
375

376
```python
377
# Basic PDF feature
378
pdf = Pdf()
379

380
# PDF without decoding (store as bytes)  
381
raw_pdf = Pdf(decode=False)
382

383
# Use in document processing dataset
384
features = Features({
385
    'document': Pdf(),
386
    'title': Value('string'),
387
    'summary': Value('string')
388
})
389
```
390

391
### Translation Features
392

393
Feature types for machine translation tasks with support for both fixed and variable language sets.
394

395
```python { .api }
396
class Translation:
397
    """Feature for translations with fixed languages per example."""
398
    
399
    def __init__(
400
        self,
401
        languages: List[str],
402
        id: Optional[str] = None,
403
    ): ...
404
    
405
    def flatten(self) -> dict: ...
406

407
class TranslationVariableLanguages:
408
    """Feature for translations with variable languages per example."""
409
    
410
    def __init__(
411
        self,
412
        languages: Optional[List] = None,
413
        num_languages: Optional[int] = None,
414
        id: Optional[str] = None,
415
    ): ...
416
    
417
    def encode_example(self, translation_dict): ...
418
    def flatten(self) -> dict: ...
419
```
420

421
**Usage Examples:**
422

423
```python
424
# Fixed languages translation
425
translation = Translation(languages=['en', 'fr', 'de'])
426

427
# Data format for fixed languages
428
example = {
429
    'en': 'the cat',
430
    'fr': 'le chat',
431
    'de': 'die katze'
432
}
433

434
# Variable languages translation
435
var_translation = TranslationVariableLanguages(languages=['en', 'fr', 'de', 'es'])
436

437
# Input format (variable number of translations per language)
438
variable_example = {
439
    'en': 'the cat',
440
    'fr': ['le chat', 'la chatte'],
441
    'de': 'die katze'
442
}
443

444
# Encoded output format
445
encoded = {
446
    'language': ['en', 'de', 'fr', 'fr'],
447
    'translation': ['the cat', 'die katze', 'la chatte', 'le chat']
448
}
449
```
450

451
## Advanced Type System Usage
452

453
### Nested Schema Definition
454

455
```python
456
# Complex nested schema
457
features = Features({
458
    'metadata': {
459
        'id': Value('string'),
460
        'timestamp': Value('timestamp[ms]'),
461
        'source': {
462
            'name': Value('string'),
463
            'version': Value('string')
464
        }
465
    },
466
    'content': {
467
        'text': Value('string'),
468
        'tokens': List(Value('string')),
469
        'entities': List({
470
            'start': Value('int32'),
471
            'end': Value('int32'),
472
            'label': ClassLabel(names=['PERSON', 'ORG', 'LOC']),
473
            'confidence': Value('float32')
474
        })
475
    },
476
    'multimedia': {
477
        'images': List(Image()),
478
        'audio': Audio(sampling_rate=16000),
479
        'video': Video(decode=False)
480
    }
481
})
482
```
483

484
### Schema Conversion and Serialization
485

486
```python
487
# Convert to Arrow schema
488
arrow_schema = features.arrow_schema
489

490
# Serialize for storage
491
features_dict = features.to_dict()
492

493
# Reconstruct from serialization
494
reconstructed = Features.from_dict(features_dict)
495

496
# Reconstruct from Arrow schema
497
from_arrow = Features.from_arrow_schema(arrow_schema)
498
```
499

500
### Data Processing Pipeline
501

502
```python
503
# Batch processing with schema
504
batch = {
505
    'text': ['Hello', 'World'],
506
    'labels': ['positive', 'negative'],
507
    'embeddings': [[0.1, 0.2], [0.3, 0.4]]
508
}
509

510
# Encode batch for Arrow storage
511
encoded_batch = features.encode_batch(batch)
512

513
# Decode batch for processing
514
decoded_batch = features.decode_batch(encoded_batch)
515
```
516

517
### Performance Considerations
518

519
- **Memory Efficiency**: Use appropriate array types (Array2D vs List) for structured data
520
- **Storage Optimization**: Consider `decode=False` for multimedia when raw bytes are sufficient
521
- **Type Conversion**: Features handle automatic type conversion and validation
522
- **Arrow Integration**: All features map to Arrow types for efficient columnar storage
523
- **Batch Processing**: Use `encode_batch`/`decode_batch` for better performance with large datasets

Version

Tile

Files

features-and-types.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

features-and-types.mddocs/