0
# Deep Lake
1
2
Deep Lake is a database for AI powered by a storage format optimized for deep-learning applications. It provides comprehensive dataset management, querying capabilities, and seamless integration with popular ML frameworks, enabling both data storage/retrieval for LLM applications and dataset management for deep learning model training.
3
4
## Package Information
5
6
- **Package Name**: deeplake
7
- **Language**: Python
8
- **Installation**: `pip install deeplake`
9
10
## Core Imports
11
12
```python
13
import deeplake
14
```
15
16
Common type imports:
17
18
```python
19
from deeplake import types
20
from deeplake.types import Image, Text, Embedding, Array
21
```
22
23
Schema template imports:
24
25
```python
26
from deeplake.schemas import TextEmbeddings, COCOImages
27
```
28
29
## Basic Usage
30
31
```python
32
import deeplake
33
34
# Create a new dataset
35
dataset = deeplake.create("./my_dataset")
36
37
# Add columns with types
38
dataset.add_column("images", deeplake.types.Image())
39
dataset.add_column("labels", deeplake.types.Text())
40
dataset.add_column("embeddings", deeplake.types.Embedding(size=768))
41
42
# Append data
43
dataset.append({
44
"images": "path/to/image.jpg",
45
"labels": "cat",
46
"embeddings": [0.1, 0.2, 0.3, ...] # 768-dimensional vector
47
})
48
49
# Commit changes
50
dataset.commit("Added initial data")
51
52
# Query data using TQL (Tensor Query Language)
53
results = deeplake.query("SELECT * FROM dataset WHERE labels == 'cat'")
54
for row in results:
55
print(row["labels"].text())
56
57
# Open existing dataset
58
dataset = deeplake.open("./my_dataset")
59
print(f"Dataset has {len(dataset)} rows")
60
61
# Framework integration
62
pytorch_dataloader = dataset.pytorch(transform=my_transform)
63
tensorflow_dataset = dataset.tensorflow()
64
```
65
66
## Architecture
67
68
Deep Lake's architecture centers around datasets as the primary abstraction, with the following key components:
69
70
- **Dataset/DatasetView**: Core data containers supporting CRUD operations, version control, and framework integration
71
- **Column/ColumnView**: Typed columns storing homogeneous data with optional indexing for performance
72
- **Row/RowView**: Individual record access with dictionary-like interfaces
73
- **Schema**: Type definitions and column specifications for data validation
74
- **Type System**: Rich type hierarchy supporting ML data types (Image, Embedding, Video, etc.)
75
- **Storage Layer**: Multi-cloud storage abstraction with built-in compression and lazy loading
76
- **Query Engine**: TQL (Tensor Query Language) for complex data filtering and aggregation
77
- **Version Control**: Git-like branching, tagging, and commit history for dataset evolution
78
79
This design enables Deep Lake to handle data of any size in a serverless manner while maintaining unified access through a single API, supporting all data types (embeddings, audio, text, videos, images, PDFs, annotations) with data versioning and lineage capabilities.
80
81
## Capabilities
82
83
### Dataset Management
84
85
Core functionality for creating, opening, deleting, and copying datasets with support for various storage backends and comprehensive lifecycle management.
86
87
```python { .api }
88
def create(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None, schema: Optional[Schema] = None) -> Dataset: ...
89
def open(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> Dataset: ...
90
def open_read_only(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> ReadOnlyDataset: ...
91
def delete(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> None: ...
92
def exists(url: str, creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> bool: ...
93
def copy(src: str, dst: str, src_creds: Optional[Dict[str, str]] = None, dst_creds: Optional[Dict[str, str]] = None, token: Optional[str] = None) -> None: ...
94
```
95
96
[Dataset Management](./dataset-management.md)
97
98
### Data Access and Manipulation
99
100
Row and column-based data access patterns with comprehensive indexing, slicing, and batch operations for efficient data manipulation.
101
102
```python { .api }
103
class Dataset:
104
def __getitem__(self, key: Union[int, slice, str]) -> Union[Row, RowRange, Column]: ...
105
def append(self, data: Dict[str, Any]) -> None: ...
106
def add_column(self, name: str, dtype: Type) -> None: ...
107
def remove_column(self, name: str) -> None: ...
108
109
class Column:
110
def __getitem__(self, key: Union[int, slice, List[int]]) -> Any: ...
111
def __setitem__(self, key: Union[int, slice, List[int]], value: Any) -> None: ...
112
```
113
114
[Data Access](./data-access.md)
115
116
### Query System
117
118
TQL (Tensor Query Language) for complex data filtering, aggregation, and transformation with SQL-like syntax optimized for tensor operations.
119
120
```python { .api }
121
def query(query: str, token: Optional[str] = None, creds: Optional[Dict[str, str]] = None) -> DatasetView: ...
122
def prepare_query(query: str, token: Optional[str] = None, creds: Optional[Dict[str, str]] = None) -> Executor: ...
123
def explain_query(query: str, token: Optional[str] = None, creds: Optional[Dict[str, str]] = None) -> ExplainQueryResult: ...
124
125
class Executor:
126
def run_single(self, parameters: Dict[str, Any]) -> DatasetView: ...
127
def run_batch(self, parameters: List[Dict[str, Any]]) -> List[DatasetView]: ...
128
```
129
130
[Query System](./query-system.md)
131
132
### Type System
133
134
Rich type hierarchy supporting all ML data types including images, embeddings, audio, video, geometric data, and custom structures with compression and indexing options.
135
136
```python { .api }
137
class Image:
138
def __init__(self, dtype: str = "uint8", sample_compression: str = "png"): ...
139
140
class Embedding:
141
def __init__(self, size: Optional[int] = None, dtype: str = "float32", index_type: Optional[IndexType] = None): ...
142
143
class Text:
144
def __init__(self, index_type: Optional[TextIndexType] = None): ...
145
146
class Array:
147
def __init__(self, dtype: DataType, dimensions: Optional[int] = None, shape: Optional[List[int]] = None): ...
148
```
149
150
[Type System](./type-system.md)
151
152
### Version Control
153
154
Git-like version control with branching, tagging, commit history, and merge operations for dataset evolution and collaboration.
155
156
```python { .api }
157
class Dataset:
158
def commit(self, message: str = "") -> str: ...
159
def branch(self, name: str) -> Branch: ...
160
def tag(self, name: str, message: str = "") -> Tag: ...
161
def push(self) -> None: ...
162
def pull(self) -> None: ...
163
164
class Branch:
165
def open(self) -> Dataset: ...
166
def delete(self) -> None: ...
167
def rename(self, new_name: str) -> None: ...
168
```
169
170
[Version Control](./version-control.md)
171
172
### Storage System
173
174
Multi-cloud storage abstraction supporting local filesystem, S3, GCS, Azure with built-in compression, encryption, and performance optimization.
175
176
```python { .api }
177
class Reader:
178
def get(self, path: str) -> bytes: ...
179
def list(self, path: str = "") -> List[str]: ...
180
def subdir(self, path: str) -> Reader: ...
181
182
class Writer:
183
def set(self, path: str, data: bytes) -> None: ...
184
def remove(self, path: str) -> None: ...
185
def subdir(self, path: str) -> Writer: ...
186
```
187
188
[Storage System](./storage-system.md)
189
190
### Data Import and Export
191
192
Comprehensive data import/export capabilities supporting various formats including Parquet, CSV, COCO datasets, and custom data ingestion pipelines.
193
194
```python { .api }
195
def from_parquet(url_or_bytes: Union[str, bytes]) -> ReadOnlyDataset: ...
196
def from_csv(url_or_bytes: Union[str, bytes]) -> ReadOnlyDataset: ...
197
def from_coco(images_directory: str, annotation_files: List[str], dest: str, dest_creds: Optional[Dict[str, str]] = None) -> Dataset: ...
198
199
class DatasetView:
200
def to_csv(self, path: str) -> None: ...
201
```
202
203
[Data Import/Export](./data-import-export.md)
204
205
### Framework Integration
206
207
Seamless integration with PyTorch and TensorFlow for training and inference workflows with optimized data loading and transformation pipelines.
208
209
```python { .api }
210
class DatasetView:
211
def pytorch(self, transform: Optional[Callable[[Any], Any]] = None) -> Any: ...
212
def tensorflow(self) -> Any: ...
213
def batches(self, batch_size: int = 1) -> Iterator[Dict[str, Any]]: ...
214
```
215
216
[Framework Integration](./framework-integration.md)
217
218
### Error Handling
219
220
Comprehensive exception handling for various failure scenarios including authentication, authorization, storage, dataset operations, and data validation with detailed error information for debugging and recovery.
221
222
```python { .api }
223
class AuthenticationError:
224
"""Authentication failed or credentials invalid."""
225
226
class AuthorizationError:
227
"""User lacks permissions for requested operation."""
228
229
class NotFoundError:
230
"""Requested dataset or resource not found."""
231
232
class StorageAccessDenied:
233
"""Access denied to storage location."""
234
235
class BranchExistsError:
236
"""Branch with given name already exists."""
237
238
class ColumnAlreadyExistsError:
239
"""Column with given name already exists."""
240
```
241
242
[Error Handling](./error-handling.md)
243
244
### Schema Templates
245
246
Pre-defined schema templates for common ML use cases including text embeddings, COCO datasets, and custom schema creation patterns.
247
248
```python { .api }
249
class TextEmbeddings:
250
def __init__(self, embedding_size: int, quantize: bool = False): ...
251
252
class COCOImages:
253
def __init__(self, embedding_size: int, quantize: bool = False, objects: bool = True, keypoints: bool = False, stuffs: bool = False): ...
254
```
255
256
[Schema Templates](./schema-templates.md)
257
258
### Client and Configuration
259
260
Client management, telemetry, and configuration utilities for Deep Lake integration and monitoring.
261
262
```python { .api }
263
class Client:
264
"""Deep Lake client for dataset operations and authentication."""
265
266
class TelemetryClient:
267
"""Telemetry client for usage tracking and analytics."""
268
269
def client() -> Client:
270
"""Get current Deep Lake client instance."""
271
272
def telemetry_client() -> TelemetryClient:
273
"""Get current telemetry client instance."""
274
275
def disconnect() -> None:
276
"""Disconnect from Deep Lake services."""
277
```
278
279
### Utilities and Helpers
280
281
Utility functions and helper classes for data generation, caching, and system optimization.
282
283
```python { .api }
284
class Random:
285
"""Random data generation utilities."""
286
287
def random() -> Random:
288
"""Get random data generator instance."""
289
290
def _create_global_cache() -> None:
291
"""Create global cache for performance optimization."""
292
293
def __prepare_atfork() -> None:
294
"""Prepare Deep Lake for fork-based multiprocessing."""
295
```
296
297
## Types
298
299
### Core Dataset Classes
300
301
```python { .api }
302
class Dataset:
303
"""Primary mutable dataset class for read-write operations."""
304
name: str
305
description: str
306
metadata: Metadata
307
schema: Schema
308
version: Version
309
history: History
310
branches: Branches
311
tags: Tags
312
313
class ReadOnlyDataset:
314
"""Read-only dataset access."""
315
name: str
316
description: str
317
metadata: ReadOnlyMetadata
318
schema: SchemaView
319
version: Version
320
history: History
321
branches: BranchesView
322
tags: TagsView
323
324
class DatasetView:
325
"""Query result view of dataset."""
326
schema: SchemaView
327
```
328
329
### Schema Classes
330
331
```python { .api }
332
class Schema:
333
"""Dataset schema management."""
334
columns: List[ColumnDefinition]
335
336
class ColumnDefinition:
337
"""Column schema information."""
338
name: str
339
dtype: Type
340
```
341
342
### Version Control Classes
343
344
```python { .api }
345
class Version:
346
"""Single version information."""
347
id: str
348
message: str
349
timestamp: str
350
client_timestamp: str
351
352
class Branch:
353
"""Dataset branch management."""
354
id: str
355
name: str
356
timestamp: str
357
base: str
358
359
class Tag:
360
"""Dataset tag management."""
361
id: str
362
name: str
363
message: str
364
version: str
365
timestamp: str
366
```
367
368
### Async Classes
369
370
```python { .api }
371
class Future[T]:
372
"""Asynchronous operation result."""
373
def result(self) -> T: ...
374
def is_completed(self) -> bool: ...
375
def cancel(self) -> bool: ...
376
377
class FutureVoid:
378
"""Asynchronous void operation."""
379
def wait(self) -> None: ...
380
def is_completed(self) -> bool: ...
381
def cancel(self) -> bool: ...
382
```