or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-access.mddata-import-export.mddataset-management.mderror-handling.mdframework-integration.mdindex.mdquery-system.mdschema-templates.mdstorage-system.mdtype-system.mdversion-control.md

index.mddocs/

0

# Deep Lake

1

2

Deep Lake is a database for AI powered by a storage format optimized for deep-learning applications. It provides comprehensive dataset management, querying capabilities, and seamless integration with popular ML frameworks, enabling both data storage/retrieval for LLM applications and dataset management for deep learning model training.

3

4

## Package Information

5

6

- **Package Name**: deeplake

7

- **Language**: Python

8

- **Installation**: `pip install deeplake`

9

10

## Core Imports

11

12

```python

13

import deeplake

14

```

15

16

Common type imports:

17

18

```python

19

from deeplake import types

20

from deeplake.types import Image, Text, Embedding, Array

21

```

22

23

Schema template imports:

24

25

```python

26

from deeplake.schemas import TextEmbeddings, COCOImages

27

```

28

29

## Basic Usage

30

31

```python

32

import deeplake

33

34

# Create a new dataset

35

dataset = deeplake.create("./my_dataset")

36

37

# Add columns with types

38

dataset.add_column("images", deeplake.types.Image())

39

dataset.add_column("labels", deeplake.types.Text())

40

dataset.add_column("embeddings", deeplake.types.Embedding(size=768))

41

42

# Append data

43

dataset.append({

44

"images": "path/to/image.jpg",

45

"labels": "cat",

46

"embeddings": [0.1, 0.2, 0.3, ...] # 768-dimensional vector

47

})

48

49

# Commit changes

50

dataset.commit("Added initial data")

51

52

# Query data using TQL (Tensor Query Language)

53

results = deeplake.query("SELECT * FROM dataset WHERE labels == 'cat'")

54

for row in results:

55

print(row["labels"].text())

56

57

# Open existing dataset

58

dataset = deeplake.open("./my_dataset")

59

print(f"Dataset has {len(dataset)} rows")

60

61

# Framework integration

62

pytorch_dataloader = dataset.pytorch(transform=my_transform)

63

tensorflow_dataset = dataset.tensorflow()

64

```

65

66

## Architecture

67

68

Deep Lake's architecture centers around datasets as the primary abstraction, with the following key components:

69

70

- **Dataset/DatasetView**: Core data containers supporting CRUD operations, version control, and framework integration

71

- **Column/ColumnView**: Typed columns storing homogeneous data with optional indexing for performance

72

- **Row/RowView**: Individual record access with dictionary-like interfaces

73

- **Schema**: Type definitions and column specifications for data validation

74

- **Type System**: Rich type hierarchy supporting ML data types (Image, Embedding, Video, etc.)

75

- **Storage Layer**: Multi-cloud storage abstraction with built-in compression and lazy loading

76

- **Query Engine**: TQL (Tensor Query Language) for complex data filtering and aggregation

77

- **Version Control**: Git-like branching, tagging, and commit history for dataset evolution

78

79

This design enables Deep Lake to handle data of any size in a serverless manner while maintaining unified access through a single API, supporting all data types (embeddings, audio, text, videos, images, PDFs, annotations) with data versioning and lineage capabilities.

80

81

## Capabilities

82

83

### Dataset Management

84

85

Core functionality for creating, opening, deleting, and copying datasets with support for various storage backends and comprehensive lifecycle management.

86

87

```python { .api }

88

def create(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None, schema: Optional[Schema] = None) -> Dataset: ...

89

def open(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> Dataset: ...

90

def open_read_only(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> ReadOnlyDataset: ...

91

def delete(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> None: ...

92

def exists(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> bool: ...

93

def copy(src: str, dst: str, src_creds: Optional[Dict[str, str]] = None, dst_creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> None: ...

94

```

95

96

[Dataset Management](./dataset-management.md)

97

98

### Data Access and Manipulation

99

100

Row and column-based data access patterns with comprehensive indexing, slicing, and batch operations for efficient data manipulation.

101

102

```python { .api }

103

class Dataset:

104

def __getitem__(self, key: Union[int, slice, str]) -> Union[Row, RowRange, Column]: ...

105

def append(self, data: Dict[str, Any]) -> None: ...

106

def add_column(self, name: str, dtype: Type) -> None: ...

107

def remove_column(self, name: str) -> None: ...

108

109

class Column:

110

def __getitem__(self, key: Union[int, slice, List[int]]) -> Any: ...

111

def __setitem__(self, key: Union[int, slice, List[int]], value: Any) -> None: ...

112

```

113

114

[Data Access](./data-access.md)

115

116

### Query System

117

118

TQL (Tensor Query Language) for complex data filtering, aggregation, and transformation with SQL-like syntax optimized for tensor operations.

119

120

```python { .api }

121

def query(query: str, token: Optional[str] = None, creds: Optional[Dict[str, str]] = None) -> DatasetView: ...

122

def prepare_query(query: str, token: Optional[str] = None, creds: Optional[Dict[str, str]] = None) -> Executor: ...

123

def explain_query(query: str, token: Optional[str] = None, creds: Optional[Dict[str, str]] = None) -> ExplainQueryResult: ...

124

125

class Executor:

126

def run_single(self, parameters: Dict[str, Any]) -> DatasetView: ...

127

def run_batch(self, parameters: List[Dict[str, Any]]) -> List[DatasetView]: ...

128

```

129

130

[Query System](./query-system.md)

131

132

### Type System

133

134

Rich type hierarchy supporting all ML data types including images, embeddings, audio, video, geometric data, and custom structures with compression and indexing options.

135

136

```python { .api }

137

class Image:

138

def __init__(self, dtype: str = "uint8", sample_compression: str = "png"): ...

139

140

class Embedding:

141

def __init__(self, size: Optional[int] = None, dtype: str = "float32", index_type: Optional[IndexType] = None): ...

142

143

class Text:

144

def __init__(self, index_type: Optional[TextIndexType] = None): ...

145

146

class Array:

147

def __init__(self, dtype: DataType, dimensions: Optional[int] = None, shape: Optional[List[int]] = None): ...

148

```

149

150

[Type System](./type-system.md)

151

152

### Version Control

153

154

Git-like version control with branching, tagging, commit history, and merge operations for dataset evolution and collaboration.

155

156

```python { .api }

157

class Dataset:

158

def commit(self, message: str = "") -> str: ...

159

def branch(self, name: str) -> Branch: ...

160

def tag(self, name: str, message: str = "") -> Tag: ...

161

def push(self) -> None: ...

162

def pull(self) -> None: ...

163

164

class Branch:

165

def open(self) -> Dataset: ...

166

def delete(self) -> None: ...

167

def rename(self, new_name: str) -> None: ...

168

```

169

170

[Version Control](./version-control.md)

171

172

### Storage System

173

174

Multi-cloud storage abstraction supporting local filesystem, S3, GCS, Azure with built-in compression, encryption, and performance optimization.

175

176

```python { .api }

177

class Reader:

178

def get(self, path: str) -> bytes: ...

179

def list(self, path: str = "") -> List[str]: ...

180

def subdir(self, path: str) -> Reader: ...

181

182

class Writer:

183

def set(self, path: str, data: bytes) -> None: ...

184

def remove(self, path: str) -> None: ...

185

def subdir(self, path: str) -> Writer: ...

186

```

187

188

[Storage System](./storage-system.md)

189

190

### Data Import and Export

191

192

Comprehensive data import/export capabilities supporting various formats including Parquet, CSV, COCO datasets, and custom data ingestion pipelines.

193

194

```python { .api }

195

def from_parquet(url_or_bytes: Union[str, bytes]) -> ReadOnlyDataset: ...

196

def from_csv(url_or_bytes: Union[str, bytes]) -> ReadOnlyDataset: ...

197

def from_coco(images_directory: str, annotation_files: List[str], dest: str, dest_creds: Optional[Dict[str, str]] = None) -> Dataset: ...

198

199

class DatasetView:

200

def to_csv(self, path: str) -> None: ...

201

```

202

203

[Data Import/Export](./data-import-export.md)

204

205

### Framework Integration

206

207

Seamless integration with PyTorch and TensorFlow for training and inference workflows with optimized data loading and transformation pipelines.

208

209

```python { .api }

210

class DatasetView:

211

def pytorch(self, transform: Optional[Callable[[Any], Any]] = None) -> Any: ...

212

def tensorflow(self) -> Any: ...

213

def batches(self, batch_size: int = 1) -> Iterator[Dict[str, Any]]: ...

214

```

215

216

[Framework Integration](./framework-integration.md)

217

218

### Error Handling

219

220

Comprehensive exception handling for various failure scenarios including authentication, authorization, storage, dataset operations, and data validation with detailed error information for debugging and recovery.

221

222

```python { .api }

223

class AuthenticationError:

224

"""Authentication failed or credentials invalid."""

225

226

class AuthorizationError:

227

"""User lacks permissions for requested operation."""

228

229

class NotFoundError:

230

"""Requested dataset or resource not found."""

231

232

class StorageAccessDenied:

233

"""Access denied to storage location."""

234

235

class BranchExistsError:

236

"""Branch with given name already exists."""

237

238

class ColumnAlreadyExistsError:

239

"""Column with given name already exists."""

240

```

241

242

[Error Handling](./error-handling.md)

243

244

### Schema Templates

245

246

Pre-defined schema templates for common ML use cases including text embeddings, COCO datasets, and custom schema creation patterns.

247

248

```python { .api }

249

class TextEmbeddings:

250

def __init__(self, embedding_size: int, quantize: bool = False): ...

251

252

class COCOImages:

253

def __init__(self, embedding_size: int, quantize: bool = False, objects: bool = True, keypoints: bool = False, stuffs: bool = False): ...

254

```

255

256

[Schema Templates](./schema-templates.md)

257

258

### Client and Configuration

259

260

Client management, telemetry, and configuration utilities for Deep Lake integration and monitoring.

261

262

```python { .api }

263

class Client:

264

"""Deep Lake client for dataset operations and authentication."""

265

266

class TelemetryClient:

267

"""Telemetry client for usage tracking and analytics."""

268

269

def client() -> Client:

270

"""Get current Deep Lake client instance."""

271

272

def telemetry_client() -> TelemetryClient:

273

"""Get current telemetry client instance."""

274

275

def disconnect() -> None:

276

"""Disconnect from Deep Lake services."""

277

```

278

279

### Utilities and Helpers

280

281

Utility functions and helper classes for data generation, caching, and system optimization.

282

283

```python { .api }

284

class Random:

285

"""Random data generation utilities."""

286

287

def random() -> Random:

288

"""Get random data generator instance."""

289

290

def _create_global_cache() -> None:

291

"""Create global cache for performance optimization."""

292

293

def __prepare_atfork() -> None:

294

"""Prepare Deep Lake for fork-based multiprocessing."""

295

```

296

297

## Types

298

299

### Core Dataset Classes

300

301

```python { .api }

302

class Dataset:

303

"""Primary mutable dataset class for read-write operations."""

304

name: str

305

description: str

306

metadata: Metadata

307

schema: Schema

308

version: Version

309

history: History

310

branches: Branches

311

tags: Tags

312

313

class ReadOnlyDataset:

314

"""Read-only dataset access."""

315

name: str

316

description: str

317

metadata: ReadOnlyMetadata

318

schema: SchemaView

319

version: Version

320

history: History

321

branches: BranchesView

322

tags: TagsView

323

324

class DatasetView:

325

"""Query result view of dataset."""

326

schema: SchemaView

327

```

328

329

### Schema Classes

330

331

```python { .api }

332

class Schema:

333

"""Dataset schema management."""

334

columns: List[ColumnDefinition]

335

336

class ColumnDefinition:

337

"""Column schema information."""

338

name: str

339

dtype: Type

340

```

341

342

### Version Control Classes

343

344

```python { .api }

345

class Version:

346

"""Single version information."""

347

id: str

348

message: str

349

timestamp: str

350

client_timestamp: str

351

352

class Branch:

353

"""Dataset branch management."""

354

id: str

355

name: str

356

timestamp: str

357

base: str

358

359

class Tag:

360

"""Dataset tag management."""

361

id: str

362

name: str

363

message: str

364

version: str

365

timestamp: str

366

```

367

368

### Async Classes

369

370

```python { .api }

371

class Future[T]:

372

"""Asynchronous operation result."""

373

def result(self) -> T: ...

374

def is_completed(self) -> bool: ...

375

def cancel(self) -> bool: ...

376

377

class FutureVoid:

378

"""Asynchronous void operation."""

379

def wait(self) -> None: ...

380

def is_completed(self) -> bool: ...

381

def cancel(self) -> bool: ...

382

```