or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-dataset-classes.mddata-loading.mddataset-building.mddataset-information.mddataset-operations.mdfeatures-and-types.mdindex.md

features-and-types.mddocs/

0

# Features and Type System

1

2

Comprehensive type system for defining and validating dataset schemas, supporting primitive types, complex nested structures, and multimedia data. The Features system enables schema validation, data encoding/decoding, and seamless integration with Apache Arrow for efficient data storage.

3

4

## Capabilities

5

6

### Features Container

7

8

The main schema container that defines the internal structure of a dataset as a dictionary mapping column names to feature types.

9

10

```python { .api }

11

class Features(dict):

12

"""A special dictionary that defines the internal structure of a dataset."""

13

14

def __init__(self, *args, **kwargs): ...

15

16

@classmethod

17

def from_arrow_schema(cls, pa_schema) -> "Features": ...

18

19

@classmethod

20

def from_dict(cls, dic) -> "Features": ...

21

22

def to_dict(self) -> dict: ...

23

def encode_example(self, example: dict) -> dict: ...

24

def decode_example(self, example: dict) -> dict: ...

25

def encode_batch(self, batch: dict) -> dict: ...

26

def decode_batch(self, batch: dict) -> dict: ...

27

def flatten(self, max_depth: int = 16) -> "Features": ...

28

def copy(self) -> "Features": ...

29

def reorder_fields_as(self, other: "Features") -> "Features": ...

30

31

# Properties

32

@property

33

def type(self): ... # PyArrow DataType representation

34

35

@property

36

def arrow_schema(self): ... # PyArrow Schema with metadata

37

```

38

39

**Usage Examples:**

40

41

```python

42

from datasets import Features, Value, ClassLabel, List

43

44

# Define dataset schema

45

features = Features({

46

'text': Value('string'),

47

'label': ClassLabel(names=['negative', 'positive']),

48

'embeddings': List(Value('float32')),

49

'metadata': {

50

'source': Value('string'),

51

'confidence': Value('float64')

52

}

53

})

54

55

# Encode data for Arrow storage

56

example = {'text': 'Hello world', 'label': 'positive', 'embeddings': [0.1, 0.2]}

57

encoded = features.encode_example(example)

58

59

# Decode data with feature-specific logic

60

decoded = features.decode_example(encoded)

61

```

62

63

### Primitive Value Types

64

65

Feature type for scalar values with support for all Arrow data types including numeric, temporal, string, and binary types.

66

67

```python { .api }

68

class Value:

69

"""Scalar feature value of a particular data type."""

70

71

def __init__(self, dtype: str, id: Optional[str] = None): ...

72

def __call__(self): ... # Returns PyArrow type

73

def encode_example(self, value): ...

74

```

75

76

**Supported Data Types:**

77

78

- **Numeric:** `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, `uint64`

79

- **Floating:** `float16`, `float32`, `float64`

80

- **Temporal:** `time32[s|ms]`, `time64[us|ns]`, `timestamp[unit]`, `date32`, `date64`, `duration[unit]`

81

- **Decimal:** `decimal128(precision, scale)`, `decimal256(precision, scale)`

82

- **Binary:** `binary`, `large_binary`

83

- **String:** `string`, `large_string`

84

- **Other:** `null`, `bool`

85

86

**Usage Examples:**

87

88

```python

89

# Basic types

90

text_feature = Value('string')

91

integer_feature = Value('int64')

92

float_feature = Value('float32')

93

boolean_feature = Value('bool')

94

95

# Temporal types

96

timestamp_feature = Value('timestamp[ms]')

97

date_feature = Value('date32')

98

99

# High precision numbers

100

decimal_feature = Value('decimal128(10, 2)')

101

```

102

103

### Categorical Labels

104

105

Feature type for integer class labels with automatic string-to-integer conversion and label name management.

106

107

```python { .api }

108

class ClassLabel:

109

"""Feature type for integer class labels."""

110

111

def __init__(

112

self,

113

num_classes: Optional[int] = None,

114

names: Optional[List[str]] = None,

115

names_file: Optional[str] = None,

116

id: Optional[str] = None,

117

): ...

118

119

def str2int(self, values: Union[str, Iterable]) -> Union[int, Iterable]: ...

120

def int2str(self, values: Union[int, Iterable]) -> Union[str, Iterable]: ...

121

def encode_example(self, example_data): ...

122

def cast_storage(self, storage) -> pa.Int64Array: ...

123

```

124

125

**Usage Examples:**

126

127

```python

128

# Define with explicit names

129

sentiment = ClassLabel(names=['negative', 'neutral', 'positive'])

130

131

# Define with number of classes (creates 0, 1, 2, ...)

132

digits = ClassLabel(num_classes=10)

133

134

# Define from file

135

categories = ClassLabel(names_file='categories.txt')

136

137

# Convert between strings and integers

138

label_int = sentiment.str2int('positive') # Returns 2

139

label_str = sentiment.int2str(2) # Returns 'positive'

140

141

# Batch conversion

142

labels = sentiment.str2int(['positive', 'negative', 'positive']) # [2, 0, 2]

143

```

144

145

### Arrays and Sequences

146

147

Feature types for list data with support for both variable-length and fixed-length sequences, including multi-dimensional arrays.

148

149

```python { .api }

150

class List:

151

"""Feature type for list data with 32-bit offsets."""

152

153

def __init__(

154

self,

155

feature: Any, # Child feature type

156

length: int = -1, # Fixed length (-1 = variable)

157

id: Optional[str] = None,

158

): ...

159

160

class LargeList:

161

"""Feature type for large list data with 64-bit offsets."""

162

163

def __init__(

164

self,

165

feature: Any, # Child feature type

166

id: Optional[str] = None,

167

): ...

168

169

class Sequence:

170

"""Utility for TensorFlow Datasets compatibility."""

171

172

def __new__(cls, feature=None, length=-1, **kwargs): ...

173

174

class Array2D:

175

"""Create a two-dimensional array."""

176

177

def __init__(self, shape: tuple, dtype: str): ...

178

179

class Array3D:

180

"""Create a three-dimensional array."""

181

182

def __init__(self, shape: tuple, dtype: str): ...

183

184

class Array4D:

185

"""Create a four-dimensional array."""

186

187

def __init__(self, shape: tuple, dtype: str): ...

188

189

class Array5D:

190

"""Create a five-dimensional array."""

191

192

def __init__(self, shape: tuple, dtype: str): ...

193

```

194

195

**Usage Examples:**

196

197

```python

198

# Variable-length list of floats

199

embeddings = List(Value('float32'))

200

201

# Fixed-length list of 100 integers

202

fixed_sequence = List(Value('int32'), length=100)

203

204

# List of categorical labels

205

label_sequence = List(ClassLabel(names=['A', 'B', 'C']))

206

207

# Multi-dimensional arrays

208

image_array = Array3D(shape=(224, 224, 3), dtype='uint8')

209

feature_matrix = Array2D(shape=(50, 768), dtype='float32')

210

211

# Large lists for big data

212

large_embeddings = LargeList(Value('float64'))

213

```

214

215

### Audio Features

216

217

Feature type for audio data with automatic format handling and optional decoding control.

218

219

```python { .api }

220

class Audio:

221

"""Audio Feature to extract audio data from files."""

222

223

def __init__(

224

self,

225

sampling_rate: Optional[int] = None,

226

decode: bool = True,

227

stream_index: Optional[int] = None,

228

id: Optional[str] = None,

229

): ...

230

231

def encode_example(self, value) -> dict: ...

232

def decode_example(self, value, token_per_repo_id=None): ...

233

def cast_storage(self, storage) -> pa.StructArray: ...

234

def embed_storage(self, storage) -> pa.StructArray: ...

235

def flatten(self) -> dict: ...

236

```

237

238

**Input Formats:**

239

- `str`: Absolute path to audio file

240

- `dict`: `{"path": str, "bytes": bytes}`

241

- `dict`: `{"array": ndarray, "sampling_rate": int}`

242

243

**Usage Examples:**

244

245

```python

246

# Basic audio feature

247

audio = Audio()

248

249

# Audio with specific sampling rate

250

speech = Audio(sampling_rate=16000)

251

252

# Audio without decoding (store as bytes)

253

raw_audio = Audio(decode=False)

254

255

# Use in dataset features

256

features = Features({

257

'audio': Audio(sampling_rate=22050),

258

'transcript': Value('string')

259

})

260

```

261

262

### Image Features

263

264

Feature type for image data with automatic format handling and optional PIL mode conversion.

265

266

```python { .api }

267

class Image:

268

"""Image Feature to read image data from files."""

269

270

def __init__(

271

self,

272

mode: Optional[str] = None, # PIL mode conversion

273

decode: bool = True,

274

id: Optional[str] = None,

275

): ...

276

277

def encode_example(self, value) -> dict: ...

278

def decode_example(self, value, token_per_repo_id=None): ...

279

def cast_storage(self, storage) -> pa.StructArray: ...

280

def embed_storage(self, storage) -> pa.StructArray: ...

281

def flatten(self): ...

282

```

283

284

**Input Formats:**

285

- `str`: Absolute path to image file

286

- `dict`: `{"path": str, "bytes": bytes}`

287

- `np.ndarray`: NumPy array representing image

288

- `PIL.Image.Image`: PIL image object

289

290

**Usage Examples:**

291

292

```python

293

# Basic image feature

294

image = Image()

295

296

# Image with mode conversion

297

rgb_image = Image(mode='RGB')

298

299

# Image without decoding (store as bytes)

300

raw_image = Image(decode=False)

301

302

# Use in computer vision dataset

303

features = Features({

304

'image': Image(mode='RGB'),

305

'label': ClassLabel(names=['cat', 'dog']),

306

'bbox': List(Value('float32'), length=4)

307

})

308

```

309

310

### Video Features

311

312

Feature type for video data with TorchCodec integration and flexible decoding options.

313

314

```python { .api }

315

class Video:

316

"""Video Feature to read video data from files."""

317

318

def __init__(

319

self,

320

decode: bool = True,

321

stream_index: Optional[int] = None,

322

dimension_order: str = "NCHW", # "NCHW" or "NHWC"

323

num_ffmpeg_threads: int = 1,

324

device: Optional[Union[str, "torch.device"]] = "cpu",

325

seek_mode: str = "exact", # "exact" or "approximate"

326

id: Optional[str] = None,

327

): ...

328

329

def encode_example(self, value): ...

330

def decode_example(self, value, token_per_repo_id=None): ...

331

def cast_storage(self, storage) -> pa.StructArray: ...

332

def flatten(self): ...

333

```

334

335

**Usage Examples:**

336

337

```python

338

# Basic video feature

339

video = Video()

340

341

# Video with specific configuration

342

optimized_video = Video(

343

dimension_order="NHWC",

344

num_ffmpeg_threads=4,

345

device="cuda",

346

seek_mode="approximate"

347

)

348

349

# Video without decoding

350

raw_video = Video(decode=False)

351

```

352

353

### PDF Features

354

355

Feature type for PDF document processing with pdfplumber integration.

356

357

```python { .api }

358

class Pdf:

359

"""Pdf Feature to read PDF documents from files."""

360

361

def __init__(

362

self,

363

decode: bool = True,

364

id: Optional[str] = None,

365

): ...

366

367

def encode_example(self, value) -> dict: ...

368

def decode_example(self, value, token_per_repo_id=None): ...

369

def cast_storage(self, storage) -> pa.StructArray: ...

370

def embed_storage(self, storage) -> pa.StructArray: ...

371

def flatten(self): ...

372

```

373

374

**Usage Examples:**

375

376

```python

377

# Basic PDF feature

378

pdf = Pdf()

379

380

# PDF without decoding (store as bytes)

381

raw_pdf = Pdf(decode=False)

382

383

# Use in document processing dataset

384

features = Features({

385

'document': Pdf(),

386

'title': Value('string'),

387

'summary': Value('string')

388

})

389

```

390

391

### Translation Features

392

393

Feature types for machine translation tasks with support for both fixed and variable language sets.

394

395

```python { .api }

396

class Translation:

397

"""Feature for translations with fixed languages per example."""

398

399

def __init__(

400

self,

401

languages: List[str],

402

id: Optional[str] = None,

403

): ...

404

405

def flatten(self) -> dict: ...

406

407

class TranslationVariableLanguages:

408

"""Feature for translations with variable languages per example."""

409

410

def __init__(

411

self,

412

languages: Optional[List] = None,

413

num_languages: Optional[int] = None,

414

id: Optional[str] = None,

415

): ...

416

417

def encode_example(self, translation_dict): ...

418

def flatten(self) -> dict: ...

419

```

420

421

**Usage Examples:**

422

423

```python

424

# Fixed languages translation

425

translation = Translation(languages=['en', 'fr', 'de'])

426

427

# Data format for fixed languages

428

example = {

429

'en': 'the cat',

430

'fr': 'le chat',

431

'de': 'die katze'

432

}

433

434

# Variable languages translation

435

var_translation = TranslationVariableLanguages(languages=['en', 'fr', 'de', 'es'])

436

437

# Input format (variable number of translations per language)

438

variable_example = {

439

'en': 'the cat',

440

'fr': ['le chat', 'la chatte'],

441

'de': 'die katze'

442

}

443

444

# Encoded output format

445

encoded = {

446

'language': ['en', 'de', 'fr', 'fr'],

447

'translation': ['the cat', 'die katze', 'la chatte', 'le chat']

448

}

449

```

450

451

## Advanced Type System Usage

452

453

### Nested Schema Definition

454

455

```python

456

# Complex nested schema

457

features = Features({

458

'metadata': {

459

'id': Value('string'),

460

'timestamp': Value('timestamp[ms]'),

461

'source': {

462

'name': Value('string'),

463

'version': Value('string')

464

}

465

},

466

'content': {

467

'text': Value('string'),

468

'tokens': List(Value('string')),

469

'entities': List({

470

'start': Value('int32'),

471

'end': Value('int32'),

472

'label': ClassLabel(names=['PERSON', 'ORG', 'LOC']),

473

'confidence': Value('float32')

474

})

475

},

476

'multimedia': {

477

'images': List(Image()),

478

'audio': Audio(sampling_rate=16000),

479

'video': Video(decode=False)

480

}

481

})

482

```

483

484

### Schema Conversion and Serialization

485

486

```python

487

# Convert to Arrow schema

488

arrow_schema = features.arrow_schema

489

490

# Serialize for storage

491

features_dict = features.to_dict()

492

493

# Reconstruct from serialization

494

reconstructed = Features.from_dict(features_dict)

495

496

# Reconstruct from Arrow schema

497

from_arrow = Features.from_arrow_schema(arrow_schema)

498

```

499

500

### Data Processing Pipeline

501

502

```python

503

# Batch processing with schema

504

batch = {

505

'text': ['Hello', 'World'],

506

'labels': ['positive', 'negative'],

507

'embeddings': [[0.1, 0.2], [0.3, 0.4]]

508

}

509

510

# Encode batch for Arrow storage

511

encoded_batch = features.encode_batch(batch)

512

513

# Decode batch for processing

514

decoded_batch = features.decode_batch(encoded_batch)

515

```

516

517

### Performance Considerations

518

519

- **Memory Efficiency**: Use appropriate array types (Array2D vs List) for structured data

520

- **Storage Optimization**: Consider `decode=False` for multimedia when raw bytes are sufficient

521

- **Type Conversion**: Features handle automatic type conversion and validation

522

- **Arrow Integration**: All features map to Arrow types for efficient columnar storage

523

- **Batch Processing**: Use `encode_batch`/`decode_batch` for better performance with large datasets