Tessl Tile for npm/@langfuse/client@4.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

autoevals-adapter.md client.md datasets.md experiments.md index.md media.md prompts.md scores.md

datasets.mddocs/

0
# Dataset Operations
1

2
The Dataset Operations system provides comprehensive capabilities for working with evaluation datasets, linking them to traces and observations, and running experiments. Datasets are collections of input-output pairs used for systematic evaluation of LLM applications.
3

4
## Capabilities
5

6
### Get Dataset
7

8
Retrieve a dataset by name with all its items, link functions, and experiment functionality.
9

10
```typescript { .api }
11
/**
12
 * Retrieves a dataset by name with all its items and experiment functionality
13
 *
14
 * Fetches a dataset and all its associated items with automatic pagination handling.
15
 * The returned dataset includes enhanced functionality for linking items to traces
16
 * and running experiments directly on the dataset.
17
 *
18
 * @param name - The name of the dataset to retrieve
19
 * @param options - Optional configuration for data fetching
20
 * @returns Promise resolving to enhanced dataset with items and experiment capabilities
21
 */
22
async get(
23
  name: string,
24
  options?: {
25
    /** Number of items to fetch per page (default: 50) */
26
    fetchItemsPageSize?: number;
27
  }
28
): Promise<FetchedDataset>;
29
```
30

31
**Usage Examples:**
32

33
```typescript
34
import { LangfuseClient } from '@langfuse/client';
35

36
const langfuse = new LangfuseClient();
37

38
// Basic dataset retrieval
39
const dataset = await langfuse.dataset.get("my-evaluation-dataset");
40

41
console.log(`Dataset: ${dataset.name}`);
42
console.log(`Description: ${dataset.description}`);
43
console.log(`Items: ${dataset.items.length}`);
44
console.log(`Metadata:`, dataset.metadata);
45

46
// Access dataset items
47
for (const item of dataset.items) {
48
  console.log('Input:', item.input);
49
  console.log('Expected Output:', item.expectedOutput);
50
  console.log('Metadata:', item.metadata);
51
}
52
```
53

54
**Handling Large Datasets:**
55

56
```typescript
57
// For large datasets, use smaller page sizes for better performance
58
const largeDataset = await langfuse.dataset.get(
59
  "large-benchmark-dataset",
60
  { fetchItemsPageSize: 100 }
61
);
62

63
console.log(`Loaded ${largeDataset.items.length} items`);
64

65
// Process items in batches
66
const batchSize = 10;
67
for (let i = 0; i < largeDataset.items.length; i += batchSize) {
68
  const batch = largeDataset.items.slice(i, i + batchSize);
69
  // Process batch...
70
}
71
```
72

73
**Accessing Dataset Properties:**
74

75
```typescript
76
const dataset = await langfuse.dataset.get("qa-dataset");
77

78
// Dataset metadata
79
console.log(dataset.id);          // Dataset ID
80
console.log(dataset.name);        // Dataset name
81
console.log(dataset.description); // Description
82
console.log(dataset.metadata);    // Custom metadata
83
console.log(dataset.projectId);   // Project ID
84
console.log(dataset.createdAt);   // Creation timestamp
85
console.log(dataset.updatedAt);   // Last update timestamp
86

87
// Item properties
88
const item = dataset.items[0];
89
console.log(item.id);                  // Item ID
90
console.log(item.datasetId);           // Parent dataset ID
91
console.log(item.input);               // Input data
92
console.log(item.expectedOutput);      // Expected output
93
console.log(item.metadata);            // Item metadata
94
console.log(item.sourceTraceId);       // Source trace (if any)
95
console.log(item.sourceObservationId); // Source observation (if any)
96
console.log(item.status);              // Status (ACTIVE or ARCHIVED)
97
```
98

99
## Types
100

101
### FetchedDataset
102

103
Enhanced dataset object with additional methods for linking and experiments.
104

105
```typescript { .api }
106
/**
107
 * Enhanced dataset with linking and experiment functionality
108
 *
109
 * Extends the base Dataset type with:
110
 * - Array of items with link functions for connecting to traces
111
 * - runExperiment method for executing experiments directly on the dataset
112
 *
113
 * @public
114
 */
115
type FetchedDataset = Dataset & {
116
  /** Dataset items with link functionality for connecting to traces */
117
  items: (DatasetItem & { link: LinkDatasetItemFunction })[];
118

119
  /** Function to run experiments directly on this dataset */
120
  runExperiment: RunExperimentOnDataset;
121
};
122
```
123

124
**Properties from Dataset:**
125

126
```typescript { .api }
127
interface Dataset {
128
  /** Unique identifier for the dataset */
129
  id: string;
130

131
  /** Human-readable name for the dataset */
132
  name: string;
133

134
  /** Optional description explaining the dataset's purpose */
135
  description?: string | null;
136

137
  /** Custom metadata attached to the dataset */
138
  metadata?: Record<string, any> | null;
139

140
  /** Project ID this dataset belongs to */
141
  projectId: string;
142

143
  /** Timestamp when the dataset was created */
144
  createdAt: string;
145

146
  /** Timestamp when the dataset was last updated */
147
  updatedAt: string;
148
}
149
```
150

151
### DatasetItem
152

153
Individual item within a dataset containing input, expected output, and metadata.
154

155
```typescript { .api }
156
/**
157
 * Dataset item with input/output pair for evaluation
158
 *
159
 * Represents a single test case within a dataset. Each item can contain
160
 * any type of input and expected output, along with optional metadata
161
 * and linkage to source traces/observations.
162
 *
163
 * @public
164
 */
165
interface DatasetItem {
166
  /** Unique identifier for the dataset item */
167
  id: string;
168

169
  /** ID of the parent dataset */
170
  datasetId: string;
171

172
  /** Name of the parent dataset */
173
  datasetName: string;
174

175
  /** Input data (can be any type: string, object, array, etc.) */
176
  input?: any;
177

178
  /** Expected output for evaluation (can be any type) */
179
  expectedOutput?: any;
180

181
  /** Custom metadata for this item */
182
  metadata?: Record<string, any> | null;
183

184
  /** ID of the trace this item was created from (if applicable) */
185
  sourceTraceId?: string | null;
186

187
  /** ID of the observation this item was created from (if applicable) */
188
  sourceObservationId?: string | null;
189

190
  /** Status of the item (ACTIVE or ARCHIVED) */
191
  status: "ACTIVE" | "ARCHIVED";
192

193
  /** Timestamp when the item was created */
194
  createdAt: string;
195

196
  /** Timestamp when the item was last updated */
197
  updatedAt: string;
198
}
199
```
200

201
### LinkDatasetItemFunction
202

203
Function type for linking dataset items to OpenTelemetry spans for tracking experiments.
204

205
```typescript { .api }
206
/**
207
 * Links dataset items to OpenTelemetry spans
208
 *
209
 * Creates a connection between a dataset item and a trace/observation,
210
 * enabling tracking of which dataset items were used in which experiments.
211
 * This is essential for creating dataset runs and tracking experiment lineage.
212
 *
213
 * @param obj - Object containing the OpenTelemetry span
214
 * @param obj.otelSpan - The OpenTelemetry span from a Langfuse observation
215
 * @param runName - Name of the experiment run for grouping related items
216
 * @param runArgs - Optional configuration for the dataset run
217
 * @returns Promise resolving to the created dataset run item
218
 *
219
 * @public
220
 */
221
type LinkDatasetItemFunction = (
222
  obj: { otelSpan: Span },
223
  runName: string,
224
  runArgs?: {
225
    /** Description of the dataset run */
226
    description?: string;
227

228
    /** Additional metadata for the dataset run */
229
    metadata?: any;
230
  }
231
) => Promise<DatasetRunItem>;
232
```
233

234
### DatasetRunItem
235

236
Result of linking a dataset item to a trace execution.
237

238
```typescript { .api }
239
/**
240
 * Linked dataset run item
241
 *
242
 * Represents the connection between a dataset item and a specific
243
 * trace execution within a dataset run. Used for tracking experiment results.
244
 *
245
 * @public
246
 */
247
interface DatasetRunItem {
248
  /** Unique identifier for the run item */
249
  id: string;
250

251
  /** ID of the dataset run this item belongs to */
252
  datasetRunId: string;
253

254
  /** Name of the dataset run this item belongs to */
255
  datasetRunName: string;
256

257
  /** ID of the dataset item */
258
  datasetItemId: string;
259

260
  /** ID of the trace this run item is linked to */
261
  traceId: string;
262

263
  /** Optional ID of the observation this run item is linked to */
264
  observationId?: string;
265

266
  /** Timestamp when the run item was created */
267
  createdAt: string;
268

269
  /** Timestamp when the run item was last updated */
270
  updatedAt: string;
271
}
272
```
273

274
### RunExperimentOnDataset
275

276
Function type for running experiments directly on fetched datasets.
277

278
```typescript { .api }
279
/**
280
 * Runs experiments on Langfuse datasets
281
 *
282
 * This function type is attached to fetched datasets to enable convenient
283
 * experiment execution. The data parameter is automatically provided from
284
 * the dataset items.
285
 *
286
 * @param params - Experiment parameters (excluding data)
287
 * @returns Promise resolving to experiment results
288
 *
289
 * @public
290
 */
291
type RunExperimentOnDataset = (
292
  params: Omit<ExperimentParams<any, any, Record<string, any>>, "data">
293
) => Promise<ExperimentResult<any, any, Record<string, any>>>;
294
```
295

296
## Usage Patterns
297

298
### Basic Dataset Retrieval and Exploration
299

300
```typescript
301
import { LangfuseClient } from '@langfuse/client';
302

303
const langfuse = new LangfuseClient({
304
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
305
  secretKey: process.env.LANGFUSE_SECRET_KEY,
306
});
307

308
// Fetch dataset
309
const dataset = await langfuse.dataset.get("customer-support-qa");
310

311
console.log(`Dataset: ${dataset.name}`);
312
console.log(`Total items: ${dataset.items.length}`);
313

314
// Explore items
315
dataset.items.forEach((item, index) => {
316
  console.log(`\nItem ${index + 1}:`);
317
  console.log('  Input:', item.input);
318
  console.log('  Expected:', item.expectedOutput);
319

320
  if (item.metadata) {
321
    console.log('  Metadata:', item.metadata);
322
  }
323
});
324
```
325

326
### Linking Dataset Items to Traces
327

328
Link dataset items to trace executions to create dataset runs and track experiment results.
329

330
```typescript
331
import { LangfuseClient } from '@langfuse/client';
332
import { startObservation } from '@langfuse/tracing';
333

334
const langfuse = new LangfuseClient();
335

336
// Fetch dataset
337
const dataset = await langfuse.dataset.get("qa-benchmark");
338
const runName = "gpt-4-evaluation-v1";
339

340
// Process each item and link to traces
341
for (const item of dataset.items) {
342
  // Create a trace for this execution
343
  const span = startObservation("qa-task", {
344
    input: item.input,
345
    metadata: { datasetItemId: item.id }
346
  });
347

348
  try {
349
    // Execute your task
350
    const output = await runYourTask(item.input);
351

352
    // Update trace with output
353
    span.update({ output });
354

355
    // Link dataset item to this trace
356
    await item.link(span, runName);
357

358
  } catch (error) {
359
    // Handle errors
360
    span.update({
361
      output: { error: String(error) },
362
      level: "ERROR"
363
    });
364

365
    // Still link the item (to track failures)
366
    await item.link(span, runName);
367
  } finally {
368
    span.end();
369
  }
370
}
371

372
console.log(`Completed dataset run: ${runName}`);
373
```
374

375
### Linking with Run Metadata
376

377
Add descriptions and metadata to dataset runs for better organization.
378

379
```typescript
380
const dataset = await langfuse.dataset.get("model-comparison");
381
const runName = "claude-3-opus-eval";
382

383
for (const item of dataset.items) {
384
  const span = startObservation("evaluation-task", {
385
    input: item.input
386
  });
387

388
  const output = await evaluateWithClaude(item.input);
389
  span.update({ output });
390
  span.end();
391

392
  // Link with descriptive metadata
393
  await item.link(span, runName, {
394
    description: "Claude 3 Opus evaluation on reasoning tasks",
395
    metadata: {
396
      modelVersion: "claude-3-opus-20240229",
397
      temperature: 0.7,
398
      maxTokens: 1000,
399
      timestamp: new Date().toISOString(),
400
      experimentGroup: "reasoning-tasks"
401
    }
402
  });
403
}
404
```
405

406
### Linking Nested Observations
407

408
Link dataset items to specific observations within a trace hierarchy.
409

410
```typescript
411
const dataset = await langfuse.dataset.get("translation-dataset");
412
const runName = "translation-pipeline-v2";
413

414
for (const item of dataset.items) {
415
  // Create parent trace
416
  const trace = startObservation("translation-pipeline", {
417
    input: item.input
418
  });
419

420
  // Create preprocessing observation
421
  const preprocessor = trace.startObservation("preprocessing", {
422
    input: item.input
423
  });
424
  const preprocessed = await preprocess(item.input);
425
  preprocessor.update({ output: preprocessed });
426
  preprocessor.end();
427

428
  // Create translation observation (the main task)
429
  const translator = trace.startObservation("translation", {
430
    input: preprocessed,
431
    model: "gpt-4"
432
  }, { asType: "generation" });
433

434
  const translated = await translate(preprocessed);
435
  translator.update({ output: translated });
436
  translator.end();
437

438
  // Create postprocessing observation
439
  const postprocessor = trace.startObservation("postprocessing", {
440
    input: translated
441
  });
442
  const final = await postprocess(translated);
443
  postprocessor.update({ output: final });
444
  postprocessor.end();
445

446
  trace.update({ output: final });
447
  trace.end();
448

449
  // Link to the specific translation observation
450
  await item.link({ otelSpan: translator.otelSpan }, runName, {
451
    description: "Translation quality evaluation",
452
    metadata: { pipeline: "v2", stage: "translation" }
453
  });
454
}
455
```
456

457
### Running Experiments on Datasets
458

459
Execute experiments directly on datasets with automatic tracing and evaluation.
460

461
```typescript
462
import { LangfuseClient } from '@langfuse/client';
463
import { observeOpenAI } from '@langfuse/openai';
464
import OpenAI from 'openai';
465

466
const langfuse = new LangfuseClient();
467

468
// Fetch dataset
469
const dataset = await langfuse.dataset.get("capital-cities");
470

471
// Define task
472
const task = async ({ input }: { input: string }) => {
473
  const client = observeOpenAI(new OpenAI());
474

475
  const response = await client.chat.completions.create({
476
    model: "gpt-4",
477
    messages: [
478
      { role: "user", content: `What is the capital of ${input}?` }
479
    ]
480
  });
481

482
  return response.choices[0].message.content;
483
};
484

485
// Define evaluator
486
const exactMatchEvaluator = async ({ output, expectedOutput }) => ({
487
  name: "exact_match",
488
  value: output === expectedOutput ? 1 : 0
489
});
490

491
// Run experiment
492
const result = await dataset.runExperiment({
493
  name: "Capital Cities Evaluation",
494
  runName: "gpt-4-baseline",
495
  description: "Baseline evaluation with GPT-4",
496
  task,
497
  evaluators: [exactMatchEvaluator],
498
  maxConcurrency: 5
499
});
500

501
// View results
502
console.log(await result.format());
503
console.log(`Dataset run URL: ${result.datasetRunUrl}`);
504
```
505

506
### Advanced Experiment with Multiple Evaluators
507

508
```typescript
509
import { LangfuseClient, Evaluator } from '@langfuse/client';
510
import { createEvaluatorFromAutoevals } from '@langfuse/client';
511
import { Levenshtein, Factuality } from 'autoevals';
512

513
const langfuse = new LangfuseClient();
514
const dataset = await langfuse.dataset.get("qa-dataset");
515

516
// Custom evaluator using OpenAI
517
const semanticSimilarityEvaluator: Evaluator = async ({
518
  output,
519
  expectedOutput
520
}) => {
521
  const openai = new OpenAI();
522

523
  const response = await openai.chat.completions.create({
524
    model: "gpt-4",
525
    messages: [
526
      {
527
        role: "user",
528
        content: `Rate the semantic similarity between these two answers on a scale of 0 to 1:
529

530
Answer 1: ${output}
531
Answer 2: ${expectedOutput}
532

533
Respond with just a number between 0 and 1.`
534
      }
535
    ]
536
  });
537

538
  const score = parseFloat(response.choices[0].message.content || "0");
539

540
  return {
541
    name: "semantic_similarity",
542
    value: score,
543
    comment: `Comparison between output and expected output`
544
  };
545
};
546

547
// Run experiment with multiple evaluators
548
const result = await dataset.runExperiment({
549
  name: "Multi-Evaluator Experiment",
550
  runName: "comprehensive-eval-v1",
551
  task: myTask,
552
  evaluators: [
553
    // AutoEvals evaluators
554
    createEvaluatorFromAutoevals(Levenshtein),
555
    createEvaluatorFromAutoevals(Factuality),
556

557
    // Custom evaluator
558
    semanticSimilarityEvaluator
559
  ]
560
});
561

562
// Analyze results
563
console.log(await result.format({ includeItemResults: true }));
564

565
// Access individual scores
566
result.itemResults.forEach((item, index) => {
567
  console.log(`\nItem ${index + 1}:`);
568
  console.log('Input:', item.input);
569
  console.log('Output:', item.output);
570
  console.log('Expected:', item.expectedOutput);
571
  console.log('Evaluations:');
572

573
  item.evaluations.forEach(evaluation => {
574
    console.log(`  ${evaluation.name}: ${evaluation.value}`);
575
    if (evaluation.comment) {
576
      console.log(`    Comment: ${evaluation.comment}`);
577
    }
578
  });
579
});
580
```
581

582
### Experiment with Run-Level Evaluators
583

584
Use run-level evaluators to compute aggregate statistics across all items.
585

586
```typescript
587
import { LangfuseClient, RunEvaluator } from '@langfuse/client';
588

589
const langfuse = new LangfuseClient();
590
const dataset = await langfuse.dataset.get("benchmark-dataset");
591

592
// Define a run-level evaluator for computing averages
593
const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {
594
  const scores = itemResults
595
    .flatMap(result => result.evaluations)
596
    .filter(eval => eval.name === "accuracy")
597
    .map(eval => eval.value as number);
598

599
  const average = scores.reduce((sum, score) => sum + score, 0) / scores.length;
600

601
  return {
602
    name: "average_accuracy",
603
    value: average,
604
    comment: `Average accuracy across ${scores.length} items`
605
  };
606
};
607

608
// Run experiment
609
const result = await dataset.runExperiment({
610
  name: "Accuracy Benchmark",
611
  task: myTask,
612
  evaluators: [accuracyEvaluator],
613
  runEvaluators: [averageScoreEvaluator]
614
});
615

616
// Check aggregate results
617
console.log('Run-level evaluations:');
618
result.runEvaluations.forEach(evaluation => {
619
  console.log(`${evaluation.name}: ${evaluation.value}`);
620
  if (evaluation.comment) {
621
    console.log(`  ${evaluation.comment}`);
622
  }
623
});
624
```
625

626
### Comparing Multiple Models
627

628
Run experiments on the same dataset with different models for comparison.
629

630
```typescript
631
import { LangfuseClient } from '@langfuse/client';
632
import OpenAI from 'openai';
633

634
const langfuse = new LangfuseClient();
635
const dataset = await langfuse.dataset.get("reasoning-tasks");
636

637
const openai = new OpenAI();
638

639
// Define models to compare
640
const models = [
641
  "gpt-4",
642
  "gpt-3.5-turbo",
643
  "gpt-4-turbo-preview"
644
];
645

646
const evaluator = async ({ output, expectedOutput }) => ({
647
  name: "correctness",
648
  value: evaluateCorrectness(output, expectedOutput)
649
});
650

651
// Run experiment for each model
652
const results = [];
653

654
for (const model of models) {
655
  const result = await dataset.runExperiment({
656
    name: "Model Comparison",
657
    runName: `${model}-evaluation`,
658
    description: `Evaluation with ${model}`,
659
    metadata: { model },
660
    task: async ({ input }) => {
661
      const response = await openai.chat.completions.create({
662
        model,
663
        messages: [{ role: "user", content: input }]
664
      });
665
      return response.choices[0].message.content;
666
    },
667
    evaluators: [evaluator],
668
    maxConcurrency: 3
669
  });
670

671
  results.push({ model, result });
672
  console.log(`Completed: ${model}`);
673
  console.log(await result.format());
674
}
675

676
// Compare results
677
console.log("\n=== Model Comparison Summary ===");
678
results.forEach(({ model, result }) => {
679
  const avgScore = result.itemResults
680
    .flatMap(r => r.evaluations)
681
    .reduce((sum, e) => sum + (e.value as number), 0) / result.itemResults.length;
682

683
  console.log(`${model}: ${avgScore.toFixed(3)}`);
684
  console.log(`  URL: ${result.datasetRunUrl}`);
685
});
686
```
687

688
### Incremental Dataset Processing
689

690
Process datasets incrementally with checkpointing for long-running experiments.
691

692
```typescript
693
import { LangfuseClient } from '@langfuse/client';
694
import { startObservation } from '@langfuse/tracing';
695
import * as fs from 'fs';
696

697
const langfuse = new LangfuseClient();
698
const dataset = await langfuse.dataset.get("large-dataset");
699
const runName = "incremental-processing-v1";
700

701
// Load checkpoint if exists
702
const checkpointFile = './checkpoint.json';
703
let processedIds = new Set<string>();
704

705
if (fs.existsSync(checkpointFile)) {
706
  const checkpoint = JSON.parse(fs.readFileSync(checkpointFile, 'utf-8'));
707
  processedIds = new Set(checkpoint.processedIds);
708
  console.log(`Resuming from checkpoint: ${processedIds.size} items processed`);
709
}
710

711
// Process items
712
for (const [index, item] of dataset.items.entries()) {
713
  // Skip already processed items
714
  if (processedIds.has(item.id)) {
715
    continue;
716
  }
717

718
  console.log(`Processing item ${index + 1}/${dataset.items.length}`);
719

720
  try {
721
    const span = startObservation("processing-task", {
722
      input: item.input,
723
      metadata: { itemId: item.id }
724
    });
725

726
    const output = await processItem(item.input);
727
    span.update({ output });
728
    span.end();
729

730
    await item.link(span, runName, {
731
      metadata: { batchIndex: Math.floor(index / 100) }
732
    });
733

734
    // Update checkpoint
735
    processedIds.add(item.id);
736
    fs.writeFileSync(
737
      checkpointFile,
738
      JSON.stringify({ processedIds: Array.from(processedIds) })
739
    );
740

741
  } catch (error) {
742
    console.error(`Error processing item ${item.id}:`, error);
743
    // Continue with next item
744
  }
745
}
746

747
console.log(`Completed processing ${processedIds.size} items`);
748

749
// Clean up checkpoint
750
fs.unlinkSync(checkpointFile);
751
```
752

753
### Parallel Processing with Concurrency Control
754

755
```typescript
756
import { LangfuseClient } from '@langfuse/client';
757
import { startObservation } from '@langfuse/tracing';
758
import pLimit from 'p-limit';
759

760
const langfuse = new LangfuseClient();
761
const dataset = await langfuse.dataset.get("parallel-dataset");
762
const runName = "parallel-processing-v1";
763

764
// Limit concurrent operations
765
const limit = pLimit(10);
766

767
// Process items in parallel with concurrency limit
768
const tasks = dataset.items.map(item =>
769
  limit(async () => {
770
    const span = startObservation("parallel-task", {
771
      input: item.input
772
    });
773

774
    try {
775
      const output = await processItem(item.input);
776
      span.update({ output });
777

778
      await item.link(span, runName);
779

780
      return { success: true, itemId: item.id };
781
    } catch (error) {
782
      span.update({
783
        output: { error: String(error) },
784
        level: "ERROR"
785
      });
786

787
      await item.link(span, runName);
788

789
      return { success: false, itemId: item.id, error };
790
    } finally {
791
      span.end();
792
    }
793
  })
794
);
795

796
// Wait for all tasks to complete
797
const results = await Promise.all(tasks);
798

799
// Summarize results
800
const successful = results.filter(r => r.success).length;
801
const failed = results.filter(r => !r.success).length;
802

803
console.log(`Completed: ${successful} successful, ${failed} failed`);
804
```
805

806
### Integration with LangChain
807

808
Use datasets with LangChain applications for systematic evaluation.
809

810
```typescript
811
import { LangfuseClient } from '@langfuse/client';
812
import { startObservation } from '@langfuse/tracing';
813
import { ChatOpenAI } from '@langchain/openai';
814
import { PromptTemplate } from '@langchain/core/prompts';
815
import { StringOutputParser } from '@langchain/core/output_parsers';
816

817
const langfuse = new LangfuseClient();
818
const dataset = await langfuse.dataset.get("langchain-eval");
819

820
// Create LangChain components
821
const prompt = PromptTemplate.fromTemplate(
822
  "Translate the following to French: {text}"
823
);
824
const model = new ChatOpenAI({ modelName: "gpt-4" });
825
const outputParser = new StringOutputParser();
826
const chain = prompt.pipe(model).pipe(outputParser);
827

828
const runName = "langchain-translation-eval";
829

830
// Process each dataset item
831
for (const item of dataset.items) {
832
  // Create trace for this execution
833
  const span = startObservation("langchain-execution", {
834
    input: { text: item.input },
835
    metadata: { chainType: "translation" }
836
  });
837

838
  try {
839
    // Execute chain
840
    const result = await chain.invoke({ text: item.input });
841

842
    // Update trace with output
843
    span.update({ output: result });
844

845
    // Link dataset item
846
    await item.link(span, runName, {
847
      description: "LangChain translation evaluation"
848
    });
849

850
    // Score the result
851
    langfuse.score.observation(span, {
852
      name: "translation_quality",
853
      value: computeQuality(result, item.expectedOutput)
854
    });
855

856
  } catch (error) {
857
    span.update({
858
      output: { error: String(error) },
859
      level: "ERROR"
860
    });
861

862
    await item.link(span, runName);
863
  }
864

865
  span.end();
866
}
867

868
// Flush scores
869
await langfuse.flush();
870
```
871

872
### Using Dataset Experiments with Custom Data Structures
873

874
```typescript
875
import { LangfuseClient } from '@langfuse/client';
876

877
const langfuse = new LangfuseClient();
878

879
// Fetch dataset with structured inputs
880
const dataset = await langfuse.dataset.get("structured-qa");
881

882
// Task that handles structured input
883
const task = async ({ input }) => {
884
  // Input is an object with specific structure
885
  const { question, context } = input;
886

887
  const response = await callLLM({
888
    systemPrompt: "Answer questions based on the context.",
889
    userPrompt: `Context: ${context}\n\nQuestion: ${question}`
890
  });
891

892
  return response;
893
};
894

895
// Evaluator that handles structured output
896
const evaluator = async ({ input, output, expectedOutput }) => {
897
  const { question } = input;
898

899
  // Complex evaluation logic
900
  const scores = {
901
    accuracy: evaluateAccuracy(output, expectedOutput),
902
    relevance: evaluateRelevance(output, question),
903
    completeness: evaluateCompleteness(output, expectedOutput)
904
  };
905

906
  // Return multiple evaluations
907
  return [
908
    { name: "accuracy", value: scores.accuracy },
909
    { name: "relevance", value: scores.relevance },
910
    { name: "completeness", value: scores.completeness },
911
    {
912
      name: "overall",
913
      value: (scores.accuracy + scores.relevance + scores.completeness) / 3,
914
      metadata: { breakdown: scores }
915
    }
916
  ];
917
};
918

919
// Run experiment
920
const result = await dataset.runExperiment({
921
  name: "Structured QA Evaluation",
922
  task,
923
  evaluators: [evaluator]
924
});
925

926
console.log(await result.format({ includeItemResults: true }));
927
```
928

929
## Best Practices
930

931
### Dataset Organization
932

933
- **Use descriptive names**: Name datasets clearly to indicate their purpose (e.g., "customer-support-qa-v2", "translation-benchmark-2024")
934
- **Add metadata**: Include relevant context in dataset and item metadata for filtering and analysis
935
- **Version datasets**: Create new dataset versions when making significant changes rather than modifying existing ones
936
- **Document expected outputs**: Always provide expected outputs when available to enable automatic evaluation
937

938
### Linking Strategy
939

940
- **Consistent run names**: Use consistent naming conventions for dataset runs (e.g., "model-name-YYYY-MM-DD-version")
941
- **Add descriptions**: Include run descriptions to document the purpose and configuration of each evaluation
942
- **Use metadata**: Attach relevant metadata (model versions, hyperparameters, etc.) to enable comparison and filtering
943
- **Link to specific observations**: When evaluating specific steps in a pipeline, link to the relevant observation rather than the root trace
944

945
### Performance Optimization
946

947
- **Adjust page size**: For large datasets, tune `fetchItemsPageSize` based on your network and memory constraints
948
- **Control concurrency**: Use `maxConcurrency` in experiments to avoid overwhelming APIs or resources
949
- **Batch processing**: Process large datasets in batches with checkpointing for resilience
950
- **Parallel execution**: Use parallel processing with concurrency limits for faster evaluation
951

952
### Experiment Design
953

954
- **Start simple**: Begin with basic evaluators and add complexity as needed
955
- **Use multiple evaluators**: Combine different evaluation approaches (exact match, semantic similarity, factuality, etc.)
956
- **Include run-level evaluators**: Compute aggregate statistics to understand overall performance
957
- **Track metadata**: Include model versions, timestamps, and configuration in experiment metadata
958
- **Version experiments**: Use versioned run names to track experiment iterations
959

960
### Error Handling
961

962
- **Handle failures gracefully**: Catch errors during task execution and still link items to track failures
963
- **Set appropriate timeouts**: Configure reasonable timeouts to prevent hanging on slow operations
964
- **Log errors**: Record error details in trace metadata for debugging
965
- **Continue on failure**: Design experiments to continue processing remaining items even if some fail
966

967
### Cost Management
968

969
- **Control concurrency**: Limit concurrent API calls to manage rate limits and costs
970
- **Cache results**: Store experiment results to avoid re-running expensive evaluations
971
- **Sample testing**: Test on a subset of items before running full evaluations
972
- **Monitor usage**: Track token usage and API calls through Langfuse traces
973

974
## Integration with Experiments
975

976
Datasets integrate seamlessly with the experiment system. For detailed information about experiment execution, evaluators, and result analysis, see the [Experiment Management documentation](./experiments.md).
977

978
### Key Integration Points
979

980
- **Automatic tracing**: Experiments on datasets automatically create traces and link them to dataset runs
981
- **Dataset run tracking**: All experiment executions on datasets are tracked as dataset runs in Langfuse
982
- **Result visualization**: Dataset run results are available in the Langfuse UI with detailed analytics
983
- **Comparison tools**: Compare multiple dataset runs to track improvements over time
984

985
## Related APIs
986

987
- **[Experiment Management](./experiments.md)**: Run experiments with tasks and evaluators
988
- **[Tracing](./tracing.md)**: Create and manage traces and observations
989
- **[Score Management](./scores.md)**: Add scores to traces and observations
990
- **[Client](./client.md)**: Initialize and configure the Langfuse client
991

Version

Tile

Files

datasets.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

datasets.mddocs/