Work with Domino Datasets - high-performance, versioned filesystem storage. Covers dataset creation, snapshots for versioning, sharing across projects, mounting paths (/domino/datasets/), and performance optimization. Use when managing data storage, creating reproducible data versions, or sharing data between projects.
80
—
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
This skill helps users work with Domino Datasets - high-performance, versioned filesystem storage for data science projects.
Activate this skill when users want to:
A Domino Dataset is:
training-data)from domino import Domino
domino = Domino("project-owner/project-name")
# Create a new dataset
dataset = domino.datasets_create(
name="training-data",
description="Training data for classification model"
)Dataset paths differ based on your project type. Domino has two project types with different mount structures.
DFS projects use /domino as the root:
/domino
|--/datasets
|--/local <== Local datasets and snapshots
|--/clapton <== Read-write dataset for owner and editor, read-only for reader
|--/mingus <== Read-write dataset for owner and editor, read-only for reader
|--/snapshots <== Snapshot folder organized by dataset
|--/clapton <== Read-write for owner and editor, read-only for reader
|--/tag1 <== Mounted under latest tag
|--/1 <== Always mounted under the snapshot number
|--/2
|--/mingus
|--/tag2
|--/1
|--/2
|--/ella <== Read-write shared dataset for owner and editor, Read-only for reader
|--/davis <== Read-write shared dataset for owner and editor, Read-only for reader
|--/snapshots <== Shared datasets snapshots organized by dataset
|--/ella <== Read-write for owner and editor, read-only for reader
|--/tag3 <== Mounted under latest tag
|--/1 <== Always mounted under the snapshot number
|--/2
|--/davis
|--/tag4
|--/1
|--/2| Dataset Type | Path |
|---|---|
| Local datasets | /domino/datasets/local/{dataset-name}/ |
| Local snapshots | /domino/datasets/local/snapshots/{dataset-name}/{tag-or-number}/ |
| Shared datasets | /domino/datasets/{dataset-name}/ |
| Shared snapshots | /domino/datasets/snapshots/{dataset-name}/{tag-or-number}/ |
Git-based projects use /mnt as the root:
/mnt
|--/data <== Local datasets and snapshots
|--/clapton <== Read-write dataset for owner and editor, read-only for reader
|--/mingus <== Read-write dataset for owner and editor, read-only for reader
|--/snapshots <== Snapshot folder organized by dataset
|--/clapton <== Read-write for owner and editor, read-only for reader
|--/tag1 <== Mounted under latest tag
|--/1 <== Always mounted under the snapshot number
|--/2
|--/mingus
|--/tag2
|--/1
|--/2
|--/imported
|--/data
|--/ella <== Read-write shared dataset for owner and editor, read-only for reader
|--/davis <== Read-write shared dataset for owner and editor, read-only for reader
|--/snapshots <== Shared dataset snapshots organized by dataset
|--/ella <== Read-write for owner and editor, read-only for reader
|--/tag3 <== Mounted under latest tag
|--/1 <== Always mounted under the snapshot number
|--/2
|--/davis
|--/tag4
|--/1
|--/2| Dataset Type | Path |
|---|---|
| Local datasets | /mnt/data/{dataset-name}/ |
| Local snapshots | /mnt/data/snapshots/{dataset-name}/{tag-or-number}/ |
| Shared datasets | /mnt/imported/data/{dataset-name}/ |
| Shared snapshots | /mnt/imported/data/snapshots/{dataset-name}/{tag-or-number}/ |
Check which paths exist in your execution:
import os
if os.path.exists("/domino/datasets"):
print("DFS Project")
dataset_root = "/domino/datasets/local"
elif os.path.exists("/mnt/data"):
print("Git-Based Project")
dataset_root = "/mnt/data"Both project types follow the same permission model:
import pandas as pd
# Git-Based Project
df = pd.read_csv("/mnt/data/training-data/customers.csv")
# DFS Project
df = pd.read_csv("/domino/datasets/local/training-data/customers.csv")
# List files
import os
files = os.listdir("/mnt/data/training-data/") # Git-Based
files = os.listdir("/domino/datasets/local/training-data/") # DFS# For large uploads, use CLI
domino upload /local/path/to/data /mnt/data/training-data/import shutil
# Copy from local to dataset
shutil.copy("local_file.csv", "/mnt/data/training-data/")
# Write directly
df.to_csv("/mnt/data/training-data/processed.csv", index=False)A snapshot is a read-only, immutable version of your dataset at a point in time. Use snapshots for:
# Via Python SDK
snapshot = domino.datasets_snapshot(
dataset_name="training-data",
tag="v1.0"
)Or via UI:
v1.0, production)# Latest snapshot
df = pd.read_csv("/mnt/data/training-data/data.csv")
# Specific tagged snapshot
df = pd.read_csv("/mnt/data/training-data@v1.0/data.csv")Tags provide friendly names for snapshots:
production: Current production datav1.0, v2.0: Version numbers2024-01-15: Date-based tagsTags can be moved to different snapshots:
# Move 'production' tag to latest snapshot
domino.datasets_tag(
dataset_name="training-data",
snapshot_id="snapshot-123",
tag="production"
)# Import dataset from another project
# Configured in project settings
df = pd.read_csv("/mnt/data/shared-dataset/data.csv")| Data Type | Storage |
|---|---|
| Large training data | Domino Dataset |
| Model artifacts | /mnt/artifacts/ |
| Code | Git/Project files |
| Temporary files | /tmp/ |
/mnt/data/my-dataset/
├── raw/
│ ├── customers.csv
│ └── transactions.csv
├── processed/
│ ├── features.parquet
│ └── labels.parquet
└── metadata/
└── schema.json# Parquet for tabular data (faster, smaller)
df.to_parquet("/mnt/data/dataset/data.parquet")
# Feather for pandas DataFrames
df.to_feather("/mnt/data/dataset/data.feather")
# HDF5 for numerical arrays
import h5py
with h5py.File("/mnt/data/dataset/data.h5", "w") as f:
f.create_dataset("features", data=features)Include README and schema:
# Write metadata
metadata = {
"created": "2024-01-15",
"source": "Customer database",
"columns": {"id": "int", "name": "string", "value": "float"}
}
with open("/mnt/data/dataset/metadata.json", "w") as f:
json.dump(metadata, f)# Create snapshot before processing
domino.datasets_snapshot(
dataset_name="training-data",
tag="pre-processing"
)
# Then modify data
process_data()# Read in chunks
chunks = pd.read_csv(
"/mnt/data/dataset/large_file.csv",
chunksize=100000
)
for chunk in chunks:
process(chunk)import dask.dataframe as dd
# Read without loading into memory
df = dd.read_parquet("/mnt/data/dataset/large_data.parquet")
# Process lazily
result = df.groupby("category").mean().compute()import numpy as np
# Memory-map large arrays
data = np.memmap(
"/mnt/data/dataset/features.dat",
dtype='float32',
mode='r',
shape=(1000000, 100)
)47c6e0a
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.