or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

autoevals-adapter.mdclient.mddatasets.mdexperiments.mdindex.mdmedia.mdprompts.mdscores.md

datasets.mddocs/

0

# Dataset Operations

1

2

The Dataset Operations system provides comprehensive capabilities for working with evaluation datasets, linking them to traces and observations, and running experiments. Datasets are collections of input-output pairs used for systematic evaluation of LLM applications.

3

4

## Capabilities

5

6

### Get Dataset

7

8

Retrieve a dataset by name with all its items, link functions, and experiment functionality.

9

10

```typescript { .api }

11

/**

12

* Retrieves a dataset by name with all its items and experiment functionality

13

*

14

* Fetches a dataset and all its associated items with automatic pagination handling.

15

* The returned dataset includes enhanced functionality for linking items to traces

16

* and running experiments directly on the dataset.

17

*

18

* @param name - The name of the dataset to retrieve

19

* @param options - Optional configuration for data fetching

20

* @returns Promise resolving to enhanced dataset with items and experiment capabilities

21

*/

22

async get(

23

name: string,

24

options?: {

25

/** Number of items to fetch per page (default: 50) */

26

fetchItemsPageSize?: number;

27

}

28

): Promise<FetchedDataset>;

29

```

30

31

**Usage Examples:**

32

33

```typescript

34

import { LangfuseClient } from '@langfuse/client';

35

36

const langfuse = new LangfuseClient();

37

38

// Basic dataset retrieval

39

const dataset = await langfuse.dataset.get("my-evaluation-dataset");

40

41

console.log(`Dataset: ${dataset.name}`);

42

console.log(`Description: ${dataset.description}`);

43

console.log(`Items: ${dataset.items.length}`);

44

console.log(`Metadata:`, dataset.metadata);

45

46

// Access dataset items

47

for (const item of dataset.items) {

48

console.log('Input:', item.input);

49

console.log('Expected Output:', item.expectedOutput);

50

console.log('Metadata:', item.metadata);

51

}

52

```

53

54

**Handling Large Datasets:**

55

56

```typescript

57

// For large datasets, use smaller page sizes for better performance

58

const largeDataset = await langfuse.dataset.get(

59

"large-benchmark-dataset",

60

{ fetchItemsPageSize: 100 }

61

);

62

63

console.log(`Loaded ${largeDataset.items.length} items`);

64

65

// Process items in batches

66

const batchSize = 10;

67

for (let i = 0; i < largeDataset.items.length; i += batchSize) {

68

const batch = largeDataset.items.slice(i, i + batchSize);

69

// Process batch...

70

}

71

```

72

73

**Accessing Dataset Properties:**

74

75

```typescript

76

const dataset = await langfuse.dataset.get("qa-dataset");

77

78

// Dataset metadata

79

console.log(dataset.id); // Dataset ID

80

console.log(dataset.name); // Dataset name

81

console.log(dataset.description); // Description

82

console.log(dataset.metadata); // Custom metadata

83

console.log(dataset.projectId); // Project ID

84

console.log(dataset.createdAt); // Creation timestamp

85

console.log(dataset.updatedAt); // Last update timestamp

86

87

// Item properties

88

const item = dataset.items[0];

89

console.log(item.id); // Item ID

90

console.log(item.datasetId); // Parent dataset ID

91

console.log(item.input); // Input data

92

console.log(item.expectedOutput); // Expected output

93

console.log(item.metadata); // Item metadata

94

console.log(item.sourceTraceId); // Source trace (if any)

95

console.log(item.sourceObservationId); // Source observation (if any)

96

console.log(item.status); // Status (ACTIVE or ARCHIVED)

97

```

98

99

## Types

100

101

### FetchedDataset

102

103

Enhanced dataset object with additional methods for linking and experiments.

104

105

```typescript { .api }

106

/**

107

* Enhanced dataset with linking and experiment functionality

108

*

109

* Extends the base Dataset type with:

110

* - Array of items with link functions for connecting to traces

111

* - runExperiment method for executing experiments directly on the dataset

112

*

113

* @public

114

*/

115

type FetchedDataset = Dataset & {

116

/** Dataset items with link functionality for connecting to traces */

117

items: (DatasetItem & { link: LinkDatasetItemFunction })[];

118

119

/** Function to run experiments directly on this dataset */

120

runExperiment: RunExperimentOnDataset;

121

};

122

```

123

124

**Properties from Dataset:**

125

126

```typescript { .api }

127

interface Dataset {

128

/** Unique identifier for the dataset */

129

id: string;

130

131

/** Human-readable name for the dataset */

132

name: string;

133

134

/** Optional description explaining the dataset's purpose */

135

description?: string | null;

136

137

/** Custom metadata attached to the dataset */

138

metadata?: Record<string, any> | null;

139

140

/** Project ID this dataset belongs to */

141

projectId: string;

142

143

/** Timestamp when the dataset was created */

144

createdAt: string;

145

146

/** Timestamp when the dataset was last updated */

147

updatedAt: string;

148

}

149

```

150

151

### DatasetItem

152

153

Individual item within a dataset containing input, expected output, and metadata.

154

155

```typescript { .api }

156

/**

157

* Dataset item with input/output pair for evaluation

158

*

159

* Represents a single test case within a dataset. Each item can contain

160

* any type of input and expected output, along with optional metadata

161

* and linkage to source traces/observations.

162

*

163

* @public

164

*/

165

interface DatasetItem {

166

/** Unique identifier for the dataset item */

167

id: string;

168

169

/** ID of the parent dataset */

170

datasetId: string;

171

172

/** Name of the parent dataset */

173

datasetName: string;

174

175

/** Input data (can be any type: string, object, array, etc.) */

176

input?: any;

177

178

/** Expected output for evaluation (can be any type) */

179

expectedOutput?: any;

180

181

/** Custom metadata for this item */

182

metadata?: Record<string, any> | null;

183

184

/** ID of the trace this item was created from (if applicable) */

185

sourceTraceId?: string | null;

186

187

/** ID of the observation this item was created from (if applicable) */

188

sourceObservationId?: string | null;

189

190

/** Status of the item (ACTIVE or ARCHIVED) */

191

status: "ACTIVE" | "ARCHIVED";

192

193

/** Timestamp when the item was created */

194

createdAt: string;

195

196

/** Timestamp when the item was last updated */

197

updatedAt: string;

198

}

199

```

200

201

### LinkDatasetItemFunction

202

203

Function type for linking dataset items to OpenTelemetry spans for tracking experiments.

204

205

```typescript { .api }

206

/**

207

* Links dataset items to OpenTelemetry spans

208

*

209

* Creates a connection between a dataset item and a trace/observation,

210

* enabling tracking of which dataset items were used in which experiments.

211

* This is essential for creating dataset runs and tracking experiment lineage.

212

*

213

* @param obj - Object containing the OpenTelemetry span

214

* @param obj.otelSpan - The OpenTelemetry span from a Langfuse observation

215

* @param runName - Name of the experiment run for grouping related items

216

* @param runArgs - Optional configuration for the dataset run

217

* @returns Promise resolving to the created dataset run item

218

*

219

* @public

220

*/

221

type LinkDatasetItemFunction = (

222

obj: { otelSpan: Span },

223

runName: string,

224

runArgs?: {

225

/** Description of the dataset run */

226

description?: string;

227

228

/** Additional metadata for the dataset run */

229

metadata?: any;

230

}

231

) => Promise<DatasetRunItem>;

232

```

233

234

### DatasetRunItem

235

236

Result of linking a dataset item to a trace execution.

237

238

```typescript { .api }

239

/**

240

* Linked dataset run item

241

*

242

* Represents the connection between a dataset item and a specific

243

* trace execution within a dataset run. Used for tracking experiment results.

244

*

245

* @public

246

*/

247

interface DatasetRunItem {

248

/** Unique identifier for the run item */

249

id: string;

250

251

/** ID of the dataset run this item belongs to */

252

datasetRunId: string;

253

254

/** Name of the dataset run this item belongs to */

255

datasetRunName: string;

256

257

/** ID of the dataset item */

258

datasetItemId: string;

259

260

/** ID of the trace this run item is linked to */

261

traceId: string;

262

263

/** Optional ID of the observation this run item is linked to */

264

observationId?: string;

265

266

/** Timestamp when the run item was created */

267

createdAt: string;

268

269

/** Timestamp when the run item was last updated */

270

updatedAt: string;

271

}

272

```

273

274

### RunExperimentOnDataset

275

276

Function type for running experiments directly on fetched datasets.

277

278

```typescript { .api }

279

/**

280

* Runs experiments on Langfuse datasets

281

*

282

* This function type is attached to fetched datasets to enable convenient

283

* experiment execution. The data parameter is automatically provided from

284

* the dataset items.

285

*

286

* @param params - Experiment parameters (excluding data)

287

* @returns Promise resolving to experiment results

288

*

289

* @public

290

*/

291

type RunExperimentOnDataset = (

292

params: Omit<ExperimentParams<any, any, Record<string, any>>, "data">

293

) => Promise<ExperimentResult<any, any, Record<string, any>>>;

294

```

295

296

## Usage Patterns

297

298

### Basic Dataset Retrieval and Exploration

299

300

```typescript

301

import { LangfuseClient } from '@langfuse/client';

302

303

const langfuse = new LangfuseClient({

304

publicKey: process.env.LANGFUSE_PUBLIC_KEY,

305

secretKey: process.env.LANGFUSE_SECRET_KEY,

306

});

307

308

// Fetch dataset

309

const dataset = await langfuse.dataset.get("customer-support-qa");

310

311

console.log(`Dataset: ${dataset.name}`);

312

console.log(`Total items: ${dataset.items.length}`);

313

314

// Explore items

315

dataset.items.forEach((item, index) => {

316

console.log(`\nItem ${index + 1}:`);

317

console.log(' Input:', item.input);

318

console.log(' Expected:', item.expectedOutput);

319

320

if (item.metadata) {

321

console.log(' Metadata:', item.metadata);

322

}

323

});

324

```

325

326

### Linking Dataset Items to Traces

327

328

Link dataset items to trace executions to create dataset runs and track experiment results.

329

330

```typescript

331

import { LangfuseClient } from '@langfuse/client';

332

import { startObservation } from '@langfuse/tracing';

333

334

const langfuse = new LangfuseClient();

335

336

// Fetch dataset

337

const dataset = await langfuse.dataset.get("qa-benchmark");

338

const runName = "gpt-4-evaluation-v1";

339

340

// Process each item and link to traces

341

for (const item of dataset.items) {

342

// Create a trace for this execution

343

const span = startObservation("qa-task", {

344

input: item.input,

345

metadata: { datasetItemId: item.id }

346

});

347

348

try {

349

// Execute your task

350

const output = await runYourTask(item.input);

351

352

// Update trace with output

353

span.update({ output });

354

355

// Link dataset item to this trace

356

await item.link(span, runName);

357

358

} catch (error) {

359

// Handle errors

360

span.update({

361

output: { error: String(error) },

362

level: "ERROR"

363

});

364

365

// Still link the item (to track failures)

366

await item.link(span, runName);

367

} finally {

368

span.end();

369

}

370

}

371

372

console.log(`Completed dataset run: ${runName}`);

373

```

374

375

### Linking with Run Metadata

376

377

Add descriptions and metadata to dataset runs for better organization.

378

379

```typescript

380

const dataset = await langfuse.dataset.get("model-comparison");

381

const runName = "claude-3-opus-eval";

382

383

for (const item of dataset.items) {

384

const span = startObservation("evaluation-task", {

385

input: item.input

386

});

387

388

const output = await evaluateWithClaude(item.input);

389

span.update({ output });

390

span.end();

391

392

// Link with descriptive metadata

393

await item.link(span, runName, {

394

description: "Claude 3 Opus evaluation on reasoning tasks",

395

metadata: {

396

modelVersion: "claude-3-opus-20240229",

397

temperature: 0.7,

398

maxTokens: 1000,

399

timestamp: new Date().toISOString(),

400

experimentGroup: "reasoning-tasks"

401

}

402

});

403

}

404

```

405

406

### Linking Nested Observations

407

408

Link dataset items to specific observations within a trace hierarchy.

409

410

```typescript

411

const dataset = await langfuse.dataset.get("translation-dataset");

412

const runName = "translation-pipeline-v2";

413

414

for (const item of dataset.items) {

415

// Create parent trace

416

const trace = startObservation("translation-pipeline", {

417

input: item.input

418

});

419

420

// Create preprocessing observation

421

const preprocessor = trace.startObservation("preprocessing", {

422

input: item.input

423

});

424

const preprocessed = await preprocess(item.input);

425

preprocessor.update({ output: preprocessed });

426

preprocessor.end();

427

428

// Create translation observation (the main task)

429

const translator = trace.startObservation("translation", {

430

input: preprocessed,

431

model: "gpt-4"

432

}, { asType: "generation" });

433

434

const translated = await translate(preprocessed);

435

translator.update({ output: translated });

436

translator.end();

437

438

// Create postprocessing observation

439

const postprocessor = trace.startObservation("postprocessing", {

440

input: translated

441

});

442

const final = await postprocess(translated);

443

postprocessor.update({ output: final });

444

postprocessor.end();

445

446

trace.update({ output: final });

447

trace.end();

448

449

// Link to the specific translation observation

450

await item.link({ otelSpan: translator.otelSpan }, runName, {

451

description: "Translation quality evaluation",

452

metadata: { pipeline: "v2", stage: "translation" }

453

});

454

}

455

```

456

457

### Running Experiments on Datasets

458

459

Execute experiments directly on datasets with automatic tracing and evaluation.

460

461

```typescript

462

import { LangfuseClient } from '@langfuse/client';

463

import { observeOpenAI } from '@langfuse/openai';

464

import OpenAI from 'openai';

465

466

const langfuse = new LangfuseClient();

467

468

// Fetch dataset

469

const dataset = await langfuse.dataset.get("capital-cities");

470

471

// Define task

472

const task = async ({ input }: { input: string }) => {

473

const client = observeOpenAI(new OpenAI());

474

475

const response = await client.chat.completions.create({

476

model: "gpt-4",

477

messages: [

478

{ role: "user", content: `What is the capital of ${input}?` }

479

]

480

});

481

482

return response.choices[0].message.content;

483

};

484

485

// Define evaluator

486

const exactMatchEvaluator = async ({ output, expectedOutput }) => ({

487

name: "exact_match",

488

value: output === expectedOutput ? 1 : 0

489

});

490

491

// Run experiment

492

const result = await dataset.runExperiment({

493

name: "Capital Cities Evaluation",

494

runName: "gpt-4-baseline",

495

description: "Baseline evaluation with GPT-4",

496

task,

497

evaluators: [exactMatchEvaluator],

498

maxConcurrency: 5

499

});

500

501

// View results

502

console.log(await result.format());

503

console.log(`Dataset run URL: ${result.datasetRunUrl}`);

504

```

505

506

### Advanced Experiment with Multiple Evaluators

507

508

```typescript

509

import { LangfuseClient, Evaluator } from '@langfuse/client';

510

import { createEvaluatorFromAutoevals } from '@langfuse/client';

511

import { Levenshtein, Factuality } from 'autoevals';

512

513

const langfuse = new LangfuseClient();

514

const dataset = await langfuse.dataset.get("qa-dataset");

515

516

// Custom evaluator using OpenAI

517

const semanticSimilarityEvaluator: Evaluator = async ({

518

output,

519

expectedOutput

520

}) => {

521

const openai = new OpenAI();

522

523

const response = await openai.chat.completions.create({

524

model: "gpt-4",

525

messages: [

526

{

527

role: "user",

528

content: `Rate the semantic similarity between these two answers on a scale of 0 to 1:

529

530

Answer 1: ${output}

531

Answer 2: ${expectedOutput}

532

533

Respond with just a number between 0 and 1.`

534

}

535

]

536

});

537

538

const score = parseFloat(response.choices[0].message.content || "0");

539

540

return {

541

name: "semantic_similarity",

542

value: score,

543

comment: `Comparison between output and expected output`

544

};

545

};

546

547

// Run experiment with multiple evaluators

548

const result = await dataset.runExperiment({

549

name: "Multi-Evaluator Experiment",

550

runName: "comprehensive-eval-v1",

551

task: myTask,

552

evaluators: [

553

// AutoEvals evaluators

554

createEvaluatorFromAutoevals(Levenshtein),

555

createEvaluatorFromAutoevals(Factuality),

556

557

// Custom evaluator

558

semanticSimilarityEvaluator

559

]

560

});

561

562

// Analyze results

563

console.log(await result.format({ includeItemResults: true }));

564

565

// Access individual scores

566

result.itemResults.forEach((item, index) => {

567

console.log(`\nItem ${index + 1}:`);

568

console.log('Input:', item.input);

569

console.log('Output:', item.output);

570

console.log('Expected:', item.expectedOutput);

571

console.log('Evaluations:');

572

573

item.evaluations.forEach(evaluation => {

574

console.log(` ${evaluation.name}: ${evaluation.value}`);

575

if (evaluation.comment) {

576

console.log(` Comment: ${evaluation.comment}`);

577

}

578

});

579

});

580

```

581

582

### Experiment with Run-Level Evaluators

583

584

Use run-level evaluators to compute aggregate statistics across all items.

585

586

```typescript

587

import { LangfuseClient, RunEvaluator } from '@langfuse/client';

588

589

const langfuse = new LangfuseClient();

590

const dataset = await langfuse.dataset.get("benchmark-dataset");

591

592

// Define a run-level evaluator for computing averages

593

const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {

594

const scores = itemResults

595

.flatMap(result => result.evaluations)

596

.filter(eval => eval.name === "accuracy")

597

.map(eval => eval.value as number);

598

599

const average = scores.reduce((sum, score) => sum + score, 0) / scores.length;

600

601

return {

602

name: "average_accuracy",

603

value: average,

604

comment: `Average accuracy across ${scores.length} items`

605

};

606

};

607

608

// Run experiment

609

const result = await dataset.runExperiment({

610

name: "Accuracy Benchmark",

611

task: myTask,

612

evaluators: [accuracyEvaluator],

613

runEvaluators: [averageScoreEvaluator]

614

});

615

616

// Check aggregate results

617

console.log('Run-level evaluations:');

618

result.runEvaluations.forEach(evaluation => {

619

console.log(`${evaluation.name}: ${evaluation.value}`);

620

if (evaluation.comment) {

621

console.log(` ${evaluation.comment}`);

622

}

623

});

624

```

625

626

### Comparing Multiple Models

627

628

Run experiments on the same dataset with different models for comparison.

629

630

```typescript

631

import { LangfuseClient } from '@langfuse/client';

632

import OpenAI from 'openai';

633

634

const langfuse = new LangfuseClient();

635

const dataset = await langfuse.dataset.get("reasoning-tasks");

636

637

const openai = new OpenAI();

638

639

// Define models to compare

640

const models = [

641

"gpt-4",

642

"gpt-3.5-turbo",

643

"gpt-4-turbo-preview"

644

];

645

646

const evaluator = async ({ output, expectedOutput }) => ({

647

name: "correctness",

648

value: evaluateCorrectness(output, expectedOutput)

649

});

650

651

// Run experiment for each model

652

const results = [];

653

654

for (const model of models) {

655

const result = await dataset.runExperiment({

656

name: "Model Comparison",

657

runName: `${model}-evaluation`,

658

description: `Evaluation with ${model}`,

659

metadata: { model },

660

task: async ({ input }) => {

661

const response = await openai.chat.completions.create({

662

model,

663

messages: [{ role: "user", content: input }]

664

});

665

return response.choices[0].message.content;

666

},

667

evaluators: [evaluator],

668

maxConcurrency: 3

669

});

670

671

results.push({ model, result });

672

console.log(`Completed: ${model}`);

673

console.log(await result.format());

674

}

675

676

// Compare results

677

console.log("\n=== Model Comparison Summary ===");

678

results.forEach(({ model, result }) => {

679

const avgScore = result.itemResults

680

.flatMap(r => r.evaluations)

681

.reduce((sum, e) => sum + (e.value as number), 0) / result.itemResults.length;

682

683

console.log(`${model}: ${avgScore.toFixed(3)}`);

684

console.log(` URL: ${result.datasetRunUrl}`);

685

});

686

```

687

688

### Incremental Dataset Processing

689

690

Process datasets incrementally with checkpointing for long-running experiments.

691

692

```typescript

693

import { LangfuseClient } from '@langfuse/client';

694

import { startObservation } from '@langfuse/tracing';

695

import * as fs from 'fs';

696

697

const langfuse = new LangfuseClient();

698

const dataset = await langfuse.dataset.get("large-dataset");

699

const runName = "incremental-processing-v1";

700

701

// Load checkpoint if exists

702

const checkpointFile = './checkpoint.json';

703

let processedIds = new Set<string>();

704

705

if (fs.existsSync(checkpointFile)) {

706

const checkpoint = JSON.parse(fs.readFileSync(checkpointFile, 'utf-8'));

707

processedIds = new Set(checkpoint.processedIds);

708

console.log(`Resuming from checkpoint: ${processedIds.size} items processed`);

709

}

710

711

// Process items

712

for (const [index, item] of dataset.items.entries()) {

713

// Skip already processed items

714

if (processedIds.has(item.id)) {

715

continue;

716

}

717

718

console.log(`Processing item ${index + 1}/${dataset.items.length}`);

719

720

try {

721

const span = startObservation("processing-task", {

722

input: item.input,

723

metadata: { itemId: item.id }

724

});

725

726

const output = await processItem(item.input);

727

span.update({ output });

728

span.end();

729

730

await item.link(span, runName, {

731

metadata: { batchIndex: Math.floor(index / 100) }

732

});

733

734

// Update checkpoint

735

processedIds.add(item.id);

736

fs.writeFileSync(

737

checkpointFile,

738

JSON.stringify({ processedIds: Array.from(processedIds) })

739

);

740

741

} catch (error) {

742

console.error(`Error processing item ${item.id}:`, error);

743

// Continue with next item

744

}

745

}

746

747

console.log(`Completed processing ${processedIds.size} items`);

748

749

// Clean up checkpoint

750

fs.unlinkSync(checkpointFile);

751

```

752

753

### Parallel Processing with Concurrency Control

754

755

```typescript

756

import { LangfuseClient } from '@langfuse/client';

757

import { startObservation } from '@langfuse/tracing';

758

import pLimit from 'p-limit';

759

760

const langfuse = new LangfuseClient();

761

const dataset = await langfuse.dataset.get("parallel-dataset");

762

const runName = "parallel-processing-v1";

763

764

// Limit concurrent operations

765

const limit = pLimit(10);

766

767

// Process items in parallel with concurrency limit

768

const tasks = dataset.items.map(item =>

769

limit(async () => {

770

const span = startObservation("parallel-task", {

771

input: item.input

772

});

773

774

try {

775

const output = await processItem(item.input);

776

span.update({ output });

777

778

await item.link(span, runName);

779

780

return { success: true, itemId: item.id };

781

} catch (error) {

782

span.update({

783

output: { error: String(error) },

784

level: "ERROR"

785

});

786

787

await item.link(span, runName);

788

789

return { success: false, itemId: item.id, error };

790

} finally {

791

span.end();

792

}

793

})

794

);

795

796

// Wait for all tasks to complete

797

const results = await Promise.all(tasks);

798

799

// Summarize results

800

const successful = results.filter(r => r.success).length;

801

const failed = results.filter(r => !r.success).length;

802

803

console.log(`Completed: ${successful} successful, ${failed} failed`);

804

```

805

806

### Integration with LangChain

807

808

Use datasets with LangChain applications for systematic evaluation.

809

810

```typescript

811

import { LangfuseClient } from '@langfuse/client';

812

import { startObservation } from '@langfuse/tracing';

813

import { ChatOpenAI } from '@langchain/openai';

814

import { PromptTemplate } from '@langchain/core/prompts';

815

import { StringOutputParser } from '@langchain/core/output_parsers';

816

817

const langfuse = new LangfuseClient();

818

const dataset = await langfuse.dataset.get("langchain-eval");

819

820

// Create LangChain components

821

const prompt = PromptTemplate.fromTemplate(

822

"Translate the following to French: {text}"

823

);

824

const model = new ChatOpenAI({ modelName: "gpt-4" });

825

const outputParser = new StringOutputParser();

826

const chain = prompt.pipe(model).pipe(outputParser);

827

828

const runName = "langchain-translation-eval";

829

830

// Process each dataset item

831

for (const item of dataset.items) {

832

// Create trace for this execution

833

const span = startObservation("langchain-execution", {

834

input: { text: item.input },

835

metadata: { chainType: "translation" }

836

});

837

838

try {

839

// Execute chain

840

const result = await chain.invoke({ text: item.input });

841

842

// Update trace with output

843

span.update({ output: result });

844

845

// Link dataset item

846

await item.link(span, runName, {

847

description: "LangChain translation evaluation"

848

});

849

850

// Score the result

851

langfuse.score.observation(span, {

852

name: "translation_quality",

853

value: computeQuality(result, item.expectedOutput)

854

});

855

856

} catch (error) {

857

span.update({

858

output: { error: String(error) },

859

level: "ERROR"

860

});

861

862

await item.link(span, runName);

863

}

864

865

span.end();

866

}

867

868

// Flush scores

869

await langfuse.flush();

870

```

871

872

### Using Dataset Experiments with Custom Data Structures

873

874

```typescript

875

import { LangfuseClient } from '@langfuse/client';

876

877

const langfuse = new LangfuseClient();

878

879

// Fetch dataset with structured inputs

880

const dataset = await langfuse.dataset.get("structured-qa");

881

882

// Task that handles structured input

883

const task = async ({ input }) => {

884

// Input is an object with specific structure

885

const { question, context } = input;

886

887

const response = await callLLM({

888

systemPrompt: "Answer questions based on the context.",

889

userPrompt: `Context: ${context}\n\nQuestion: ${question}`

890

});

891

892

return response;

893

};

894

895

// Evaluator that handles structured output

896

const evaluator = async ({ input, output, expectedOutput }) => {

897

const { question } = input;

898

899

// Complex evaluation logic

900

const scores = {

901

accuracy: evaluateAccuracy(output, expectedOutput),

902

relevance: evaluateRelevance(output, question),

903

completeness: evaluateCompleteness(output, expectedOutput)

904

};

905

906

// Return multiple evaluations

907

return [

908

{ name: "accuracy", value: scores.accuracy },

909

{ name: "relevance", value: scores.relevance },

910

{ name: "completeness", value: scores.completeness },

911

{

912

name: "overall",

913

value: (scores.accuracy + scores.relevance + scores.completeness) / 3,

914

metadata: { breakdown: scores }

915

}

916

];

917

};

918

919

// Run experiment

920

const result = await dataset.runExperiment({

921

name: "Structured QA Evaluation",

922

task,

923

evaluators: [evaluator]

924

});

925

926

console.log(await result.format({ includeItemResults: true }));

927

```

928

929

## Best Practices

930

931

### Dataset Organization

932

933

- **Use descriptive names**: Name datasets clearly to indicate their purpose (e.g., "customer-support-qa-v2", "translation-benchmark-2024")

934

- **Add metadata**: Include relevant context in dataset and item metadata for filtering and analysis

935

- **Version datasets**: Create new dataset versions when making significant changes rather than modifying existing ones

936

- **Document expected outputs**: Always provide expected outputs when available to enable automatic evaluation

937

938

### Linking Strategy

939

940

- **Consistent run names**: Use consistent naming conventions for dataset runs (e.g., "model-name-YYYY-MM-DD-version")

941

- **Add descriptions**: Include run descriptions to document the purpose and configuration of each evaluation

942

- **Use metadata**: Attach relevant metadata (model versions, hyperparameters, etc.) to enable comparison and filtering

943

- **Link to specific observations**: When evaluating specific steps in a pipeline, link to the relevant observation rather than the root trace

944

945

### Performance Optimization

946

947

- **Adjust page size**: For large datasets, tune `fetchItemsPageSize` based on your network and memory constraints

948

- **Control concurrency**: Use `maxConcurrency` in experiments to avoid overwhelming APIs or resources

949

- **Batch processing**: Process large datasets in batches with checkpointing for resilience

950

- **Parallel execution**: Use parallel processing with concurrency limits for faster evaluation

951

952

### Experiment Design

953

954

- **Start simple**: Begin with basic evaluators and add complexity as needed

955

- **Use multiple evaluators**: Combine different evaluation approaches (exact match, semantic similarity, factuality, etc.)

956

- **Include run-level evaluators**: Compute aggregate statistics to understand overall performance

957

- **Track metadata**: Include model versions, timestamps, and configuration in experiment metadata

958

- **Version experiments**: Use versioned run names to track experiment iterations

959

960

### Error Handling

961

962

- **Handle failures gracefully**: Catch errors during task execution and still link items to track failures

963

- **Set appropriate timeouts**: Configure reasonable timeouts to prevent hanging on slow operations

964

- **Log errors**: Record error details in trace metadata for debugging

965

- **Continue on failure**: Design experiments to continue processing remaining items even if some fail

966

967

### Cost Management

968

969

- **Control concurrency**: Limit concurrent API calls to manage rate limits and costs

970

- **Cache results**: Store experiment results to avoid re-running expensive evaluations

971

- **Sample testing**: Test on a subset of items before running full evaluations

972

- **Monitor usage**: Track token usage and API calls through Langfuse traces

973

974

## Integration with Experiments

975

976

Datasets integrate seamlessly with the experiment system. For detailed information about experiment execution, evaluators, and result analysis, see the [Experiment Management documentation](./experiments.md).

977

978

### Key Integration Points

979

980

- **Automatic tracing**: Experiments on datasets automatically create traces and link them to dataset runs

981

- **Dataset run tracking**: All experiment executions on datasets are tracked as dataset runs in Langfuse

982

- **Result visualization**: Dataset run results are available in the Langfuse UI with detailed analytics

983

- **Comparison tools**: Compare multiple dataset runs to track improvements over time

984

985

## Related APIs

986

987

- **[Experiment Management](./experiments.md)**: Run experiments with tasks and evaluators

988

- **[Tracing](./tracing.md)**: Create and manage traces and observations

989

- **[Score Management](./scores.md)**: Add scores to traces and observations

990

- **[Client](./client.md)**: Initialize and configure the Langfuse client

991