0
# Features and Type System
1
2
Comprehensive type system for defining and validating dataset schemas, supporting primitive types, complex nested structures, and multimedia data. The Features system enables schema validation, data encoding/decoding, and seamless integration with Apache Arrow for efficient data storage.
3
4
## Capabilities
5
6
### Features Container
7
8
The main schema container that defines the internal structure of a dataset as a dictionary mapping column names to feature types.
9
10
```python { .api }
11
class Features(dict):
12
"""A special dictionary that defines the internal structure of a dataset."""
13
14
def __init__(self, *args, **kwargs): ...
15
16
@classmethod
17
def from_arrow_schema(cls, pa_schema) -> "Features": ...
18
19
@classmethod
20
def from_dict(cls, dic) -> "Features": ...
21
22
def to_dict(self) -> dict: ...
23
def encode_example(self, example: dict) -> dict: ...
24
def decode_example(self, example: dict) -> dict: ...
25
def encode_batch(self, batch: dict) -> dict: ...
26
def decode_batch(self, batch: dict) -> dict: ...
27
def flatten(self, max_depth: int = 16) -> "Features": ...
28
def copy(self) -> "Features": ...
29
def reorder_fields_as(self, other: "Features") -> "Features": ...
30
31
# Properties
32
@property
33
def type(self): ... # PyArrow DataType representation
34
35
@property
36
def arrow_schema(self): ... # PyArrow Schema with metadata
37
```
38
39
**Usage Examples:**
40
41
```python
42
from datasets import Features, Value, ClassLabel, List
43
44
# Define dataset schema
45
features = Features({
46
'text': Value('string'),
47
'label': ClassLabel(names=['negative', 'positive']),
48
'embeddings': List(Value('float32')),
49
'metadata': {
50
'source': Value('string'),
51
'confidence': Value('float64')
52
}
53
})
54
55
# Encode data for Arrow storage
56
example = {'text': 'Hello world', 'label': 'positive', 'embeddings': [0.1, 0.2]}
57
encoded = features.encode_example(example)
58
59
# Decode data with feature-specific logic
60
decoded = features.decode_example(encoded)
61
```
62
63
### Primitive Value Types
64
65
Feature type for scalar values with support for all Arrow data types including numeric, temporal, string, and binary types.
66
67
```python { .api }
68
class Value:
69
"""Scalar feature value of a particular data type."""
70
71
def __init__(self, dtype: str, id: Optional[str] = None): ...
72
def __call__(self): ... # Returns PyArrow type
73
def encode_example(self, value): ...
74
```
75
76
**Supported Data Types:**
77
78
- **Numeric:** `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, `uint64`
79
- **Floating:** `float16`, `float32`, `float64`
80
- **Temporal:** `time32[s|ms]`, `time64[us|ns]`, `timestamp[unit]`, `date32`, `date64`, `duration[unit]`
81
- **Decimal:** `decimal128(precision, scale)`, `decimal256(precision, scale)`
82
- **Binary:** `binary`, `large_binary`
83
- **String:** `string`, `large_string`
84
- **Other:** `null`, `bool`
85
86
**Usage Examples:**
87
88
```python
89
# Basic types
90
text_feature = Value('string')
91
integer_feature = Value('int64')
92
float_feature = Value('float32')
93
boolean_feature = Value('bool')
94
95
# Temporal types
96
timestamp_feature = Value('timestamp[ms]')
97
date_feature = Value('date32')
98
99
# High precision numbers
100
decimal_feature = Value('decimal128(10, 2)')
101
```
102
103
### Categorical Labels
104
105
Feature type for integer class labels with automatic string-to-integer conversion and label name management.
106
107
```python { .api }
108
class ClassLabel:
109
"""Feature type for integer class labels."""
110
111
def __init__(
112
self,
113
num_classes: Optional[int] = None,
114
names: Optional[List[str]] = None,
115
names_file: Optional[str] = None,
116
id: Optional[str] = None,
117
): ...
118
119
def str2int(self, values: Union[str, Iterable]) -> Union[int, Iterable]: ...
120
def int2str(self, values: Union[int, Iterable]) -> Union[str, Iterable]: ...
121
def encode_example(self, example_data): ...
122
def cast_storage(self, storage) -> pa.Int64Array: ...
123
```
124
125
**Usage Examples:**
126
127
```python
128
# Define with explicit names
129
sentiment = ClassLabel(names=['negative', 'neutral', 'positive'])
130
131
# Define with number of classes (creates 0, 1, 2, ...)
132
digits = ClassLabel(num_classes=10)
133
134
# Define from file
135
categories = ClassLabel(names_file='categories.txt')
136
137
# Convert between strings and integers
138
label_int = sentiment.str2int('positive') # Returns 2
139
label_str = sentiment.int2str(2) # Returns 'positive'
140
141
# Batch conversion
142
labels = sentiment.str2int(['positive', 'negative', 'positive']) # [2, 0, 2]
143
```
144
145
### Arrays and Sequences
146
147
Feature types for list data with support for both variable-length and fixed-length sequences, including multi-dimensional arrays.
148
149
```python { .api }
150
class List:
151
"""Feature type for list data with 32-bit offsets."""
152
153
def __init__(
154
self,
155
feature: Any, # Child feature type
156
length: int = -1, # Fixed length (-1 = variable)
157
id: Optional[str] = None,
158
): ...
159
160
class LargeList:
161
"""Feature type for large list data with 64-bit offsets."""
162
163
def __init__(
164
self,
165
feature: Any, # Child feature type
166
id: Optional[str] = None,
167
): ...
168
169
class Sequence:
170
"""Utility for TensorFlow Datasets compatibility."""
171
172
def __new__(cls, feature=None, length=-1, **kwargs): ...
173
174
class Array2D:
175
"""Create a two-dimensional array."""
176
177
def __init__(self, shape: tuple, dtype: str): ...
178
179
class Array3D:
180
"""Create a three-dimensional array."""
181
182
def __init__(self, shape: tuple, dtype: str): ...
183
184
class Array4D:
185
"""Create a four-dimensional array."""
186
187
def __init__(self, shape: tuple, dtype: str): ...
188
189
class Array5D:
190
"""Create a five-dimensional array."""
191
192
def __init__(self, shape: tuple, dtype: str): ...
193
```
194
195
**Usage Examples:**
196
197
```python
198
# Variable-length list of floats
199
embeddings = List(Value('float32'))
200
201
# Fixed-length list of 100 integers
202
fixed_sequence = List(Value('int32'), length=100)
203
204
# List of categorical labels
205
label_sequence = List(ClassLabel(names=['A', 'B', 'C']))
206
207
# Multi-dimensional arrays
208
image_array = Array3D(shape=(224, 224, 3), dtype='uint8')
209
feature_matrix = Array2D(shape=(50, 768), dtype='float32')
210
211
# Large lists for big data
212
large_embeddings = LargeList(Value('float64'))
213
```
214
215
### Audio Features
216
217
Feature type for audio data with automatic format handling and optional decoding control.
218
219
```python { .api }
220
class Audio:
221
"""Audio Feature to extract audio data from files."""
222
223
def __init__(
224
self,
225
sampling_rate: Optional[int] = None,
226
decode: bool = True,
227
stream_index: Optional[int] = None,
228
id: Optional[str] = None,
229
): ...
230
231
def encode_example(self, value) -> dict: ...
232
def decode_example(self, value, token_per_repo_id=None): ...
233
def cast_storage(self, storage) -> pa.StructArray: ...
234
def embed_storage(self, storage) -> pa.StructArray: ...
235
def flatten(self) -> dict: ...
236
```
237
238
**Input Formats:**
239
- `str`: Absolute path to audio file
240
- `dict`: `{"path": str, "bytes": bytes}`
241
- `dict`: `{"array": ndarray, "sampling_rate": int}`
242
243
**Usage Examples:**
244
245
```python
246
# Basic audio feature
247
audio = Audio()
248
249
# Audio with specific sampling rate
250
speech = Audio(sampling_rate=16000)
251
252
# Audio without decoding (store as bytes)
253
raw_audio = Audio(decode=False)
254
255
# Use in dataset features
256
features = Features({
257
'audio': Audio(sampling_rate=22050),
258
'transcript': Value('string')
259
})
260
```
261
262
### Image Features
263
264
Feature type for image data with automatic format handling and optional PIL mode conversion.
265
266
```python { .api }
267
class Image:
268
"""Image Feature to read image data from files."""
269
270
def __init__(
271
self,
272
mode: Optional[str] = None, # PIL mode conversion
273
decode: bool = True,
274
id: Optional[str] = None,
275
): ...
276
277
def encode_example(self, value) -> dict: ...
278
def decode_example(self, value, token_per_repo_id=None): ...
279
def cast_storage(self, storage) -> pa.StructArray: ...
280
def embed_storage(self, storage) -> pa.StructArray: ...
281
def flatten(self): ...
282
```
283
284
**Input Formats:**
285
- `str`: Absolute path to image file
286
- `dict`: `{"path": str, "bytes": bytes}`
287
- `np.ndarray`: NumPy array representing image
288
- `PIL.Image.Image`: PIL image object
289
290
**Usage Examples:**
291
292
```python
293
# Basic image feature
294
image = Image()
295
296
# Image with mode conversion
297
rgb_image = Image(mode='RGB')
298
299
# Image without decoding (store as bytes)
300
raw_image = Image(decode=False)
301
302
# Use in computer vision dataset
303
features = Features({
304
'image': Image(mode='RGB'),
305
'label': ClassLabel(names=['cat', 'dog']),
306
'bbox': List(Value('float32'), length=4)
307
})
308
```
309
310
### Video Features
311
312
Feature type for video data with TorchCodec integration and flexible decoding options.
313
314
```python { .api }
315
class Video:
316
"""Video Feature to read video data from files."""
317
318
def __init__(
319
self,
320
decode: bool = True,
321
stream_index: Optional[int] = None,
322
dimension_order: str = "NCHW", # "NCHW" or "NHWC"
323
num_ffmpeg_threads: int = 1,
324
device: Optional[Union[str, "torch.device"]] = "cpu",
325
seek_mode: str = "exact", # "exact" or "approximate"
326
id: Optional[str] = None,
327
): ...
328
329
def encode_example(self, value): ...
330
def decode_example(self, value, token_per_repo_id=None): ...
331
def cast_storage(self, storage) -> pa.StructArray: ...
332
def flatten(self): ...
333
```
334
335
**Usage Examples:**
336
337
```python
338
# Basic video feature
339
video = Video()
340
341
# Video with specific configuration
342
optimized_video = Video(
343
dimension_order="NHWC",
344
num_ffmpeg_threads=4,
345
device="cuda",
346
seek_mode="approximate"
347
)
348
349
# Video without decoding
350
raw_video = Video(decode=False)
351
```
352
353
### PDF Features
354
355
Feature type for PDF document processing with pdfplumber integration.
356
357
```python { .api }
358
class Pdf:
359
"""Pdf Feature to read PDF documents from files."""
360
361
def __init__(
362
self,
363
decode: bool = True,
364
id: Optional[str] = None,
365
): ...
366
367
def encode_example(self, value) -> dict: ...
368
def decode_example(self, value, token_per_repo_id=None): ...
369
def cast_storage(self, storage) -> pa.StructArray: ...
370
def embed_storage(self, storage) -> pa.StructArray: ...
371
def flatten(self): ...
372
```
373
374
**Usage Examples:**
375
376
```python
377
# Basic PDF feature
378
pdf = Pdf()
379
380
# PDF without decoding (store as bytes)
381
raw_pdf = Pdf(decode=False)
382
383
# Use in document processing dataset
384
features = Features({
385
'document': Pdf(),
386
'title': Value('string'),
387
'summary': Value('string')
388
})
389
```
390
391
### Translation Features
392
393
Feature types for machine translation tasks with support for both fixed and variable language sets.
394
395
```python { .api }
396
class Translation:
397
"""Feature for translations with fixed languages per example."""
398
399
def __init__(
400
self,
401
languages: List[str],
402
id: Optional[str] = None,
403
): ...
404
405
def flatten(self) -> dict: ...
406
407
class TranslationVariableLanguages:
408
"""Feature for translations with variable languages per example."""
409
410
def __init__(
411
self,
412
languages: Optional[List] = None,
413
num_languages: Optional[int] = None,
414
id: Optional[str] = None,
415
): ...
416
417
def encode_example(self, translation_dict): ...
418
def flatten(self) -> dict: ...
419
```
420
421
**Usage Examples:**
422
423
```python
424
# Fixed languages translation
425
translation = Translation(languages=['en', 'fr', 'de'])
426
427
# Data format for fixed languages
428
example = {
429
'en': 'the cat',
430
'fr': 'le chat',
431
'de': 'die katze'
432
}
433
434
# Variable languages translation
435
var_translation = TranslationVariableLanguages(languages=['en', 'fr', 'de', 'es'])
436
437
# Input format (variable number of translations per language)
438
variable_example = {
439
'en': 'the cat',
440
'fr': ['le chat', 'la chatte'],
441
'de': 'die katze'
442
}
443
444
# Encoded output format
445
encoded = {
446
'language': ['en', 'de', 'fr', 'fr'],
447
'translation': ['the cat', 'die katze', 'la chatte', 'le chat']
448
}
449
```
450
451
## Advanced Type System Usage
452
453
### Nested Schema Definition
454
455
```python
456
# Complex nested schema
457
features = Features({
458
'metadata': {
459
'id': Value('string'),
460
'timestamp': Value('timestamp[ms]'),
461
'source': {
462
'name': Value('string'),
463
'version': Value('string')
464
}
465
},
466
'content': {
467
'text': Value('string'),
468
'tokens': List(Value('string')),
469
'entities': List({
470
'start': Value('int32'),
471
'end': Value('int32'),
472
'label': ClassLabel(names=['PERSON', 'ORG', 'LOC']),
473
'confidence': Value('float32')
474
})
475
},
476
'multimedia': {
477
'images': List(Image()),
478
'audio': Audio(sampling_rate=16000),
479
'video': Video(decode=False)
480
}
481
})
482
```
483
484
### Schema Conversion and Serialization
485
486
```python
487
# Convert to Arrow schema
488
arrow_schema = features.arrow_schema
489
490
# Serialize for storage
491
features_dict = features.to_dict()
492
493
# Reconstruct from serialization
494
reconstructed = Features.from_dict(features_dict)
495
496
# Reconstruct from Arrow schema
497
from_arrow = Features.from_arrow_schema(arrow_schema)
498
```
499
500
### Data Processing Pipeline
501
502
```python
503
# Batch processing with schema
504
batch = {
505
'text': ['Hello', 'World'],
506
'labels': ['positive', 'negative'],
507
'embeddings': [[0.1, 0.2], [0.3, 0.4]]
508
}
509
510
# Encode batch for Arrow storage
511
encoded_batch = features.encode_batch(batch)
512
513
# Decode batch for processing
514
decoded_batch = features.decode_batch(encoded_batch)
515
```
516
517
### Performance Considerations
518
519
- **Memory Efficiency**: Use appropriate array types (Array2D vs List) for structured data
520
- **Storage Optimization**: Consider `decode=False` for multimedia when raw bytes are sufficient
521
- **Type Conversion**: Features handle automatic type conversion and validation
522
- **Arrow Integration**: All features map to Arrow types for efficient columnar storage
523
- **Batch Processing**: Use `encode_batch`/`decode_batch` for better performance with large datasets