0
# Dataset Operations
1
2
The Dataset Operations system provides comprehensive capabilities for working with evaluation datasets, linking them to traces and observations, and running experiments. Datasets are collections of input-output pairs used for systematic evaluation of LLM applications.
3
4
## Capabilities
5
6
### Get Dataset
7
8
Retrieve a dataset by name with all its items, link functions, and experiment functionality.
9
10
```typescript { .api }
11
/**
12
* Retrieves a dataset by name with all its items and experiment functionality
13
*
14
* Fetches a dataset and all its associated items with automatic pagination handling.
15
* The returned dataset includes enhanced functionality for linking items to traces
16
* and running experiments directly on the dataset.
17
*
18
* @param name - The name of the dataset to retrieve
19
* @param options - Optional configuration for data fetching
20
* @returns Promise resolving to enhanced dataset with items and experiment capabilities
21
*/
22
async get(
23
name: string,
24
options?: {
25
/** Number of items to fetch per page (default: 50) */
26
fetchItemsPageSize?: number;
27
}
28
): Promise<FetchedDataset>;
29
```
30
31
**Usage Examples:**
32
33
```typescript
34
import { LangfuseClient } from '@langfuse/client';
35
36
const langfuse = new LangfuseClient();
37
38
// Basic dataset retrieval
39
const dataset = await langfuse.dataset.get("my-evaluation-dataset");
40
41
console.log(`Dataset: ${dataset.name}`);
42
console.log(`Description: ${dataset.description}`);
43
console.log(`Items: ${dataset.items.length}`);
44
console.log(`Metadata:`, dataset.metadata);
45
46
// Access dataset items
47
for (const item of dataset.items) {
48
console.log('Input:', item.input);
49
console.log('Expected Output:', item.expectedOutput);
50
console.log('Metadata:', item.metadata);
51
}
52
```
53
54
**Handling Large Datasets:**
55
56
```typescript
57
// For large datasets, use smaller page sizes for better performance
58
const largeDataset = await langfuse.dataset.get(
59
"large-benchmark-dataset",
60
{ fetchItemsPageSize: 100 }
61
);
62
63
console.log(`Loaded ${largeDataset.items.length} items`);
64
65
// Process items in batches
66
const batchSize = 10;
67
for (let i = 0; i < largeDataset.items.length; i += batchSize) {
68
const batch = largeDataset.items.slice(i, i + batchSize);
69
// Process batch...
70
}
71
```
72
73
**Accessing Dataset Properties:**
74
75
```typescript
76
const dataset = await langfuse.dataset.get("qa-dataset");
77
78
// Dataset metadata
79
console.log(dataset.id); // Dataset ID
80
console.log(dataset.name); // Dataset name
81
console.log(dataset.description); // Description
82
console.log(dataset.metadata); // Custom metadata
83
console.log(dataset.projectId); // Project ID
84
console.log(dataset.createdAt); // Creation timestamp
85
console.log(dataset.updatedAt); // Last update timestamp
86
87
// Item properties
88
const item = dataset.items[0];
89
console.log(item.id); // Item ID
90
console.log(item.datasetId); // Parent dataset ID
91
console.log(item.input); // Input data
92
console.log(item.expectedOutput); // Expected output
93
console.log(item.metadata); // Item metadata
94
console.log(item.sourceTraceId); // Source trace (if any)
95
console.log(item.sourceObservationId); // Source observation (if any)
96
console.log(item.status); // Status (ACTIVE or ARCHIVED)
97
```
98
99
## Types
100
101
### FetchedDataset
102
103
Enhanced dataset object with additional methods for linking and experiments.
104
105
```typescript { .api }
106
/**
107
* Enhanced dataset with linking and experiment functionality
108
*
109
* Extends the base Dataset type with:
110
* - Array of items with link functions for connecting to traces
111
* - runExperiment method for executing experiments directly on the dataset
112
*
113
* @public
114
*/
115
type FetchedDataset = Dataset & {
116
/** Dataset items with link functionality for connecting to traces */
117
items: (DatasetItem & { link: LinkDatasetItemFunction })[];
118
119
/** Function to run experiments directly on this dataset */
120
runExperiment: RunExperimentOnDataset;
121
};
122
```
123
124
**Properties from Dataset:**
125
126
```typescript { .api }
127
interface Dataset {
128
/** Unique identifier for the dataset */
129
id: string;
130
131
/** Human-readable name for the dataset */
132
name: string;
133
134
/** Optional description explaining the dataset's purpose */
135
description?: string | null;
136
137
/** Custom metadata attached to the dataset */
138
metadata?: Record<string, any> | null;
139
140
/** Project ID this dataset belongs to */
141
projectId: string;
142
143
/** Timestamp when the dataset was created */
144
createdAt: string;
145
146
/** Timestamp when the dataset was last updated */
147
updatedAt: string;
148
}
149
```
150
151
### DatasetItem
152
153
Individual item within a dataset containing input, expected output, and metadata.
154
155
```typescript { .api }
156
/**
157
* Dataset item with input/output pair for evaluation
158
*
159
* Represents a single test case within a dataset. Each item can contain
160
* any type of input and expected output, along with optional metadata
161
* and linkage to source traces/observations.
162
*
163
* @public
164
*/
165
interface DatasetItem {
166
/** Unique identifier for the dataset item */
167
id: string;
168
169
/** ID of the parent dataset */
170
datasetId: string;
171
172
/** Name of the parent dataset */
173
datasetName: string;
174
175
/** Input data (can be any type: string, object, array, etc.) */
176
input?: any;
177
178
/** Expected output for evaluation (can be any type) */
179
expectedOutput?: any;
180
181
/** Custom metadata for this item */
182
metadata?: Record<string, any> | null;
183
184
/** ID of the trace this item was created from (if applicable) */
185
sourceTraceId?: string | null;
186
187
/** ID of the observation this item was created from (if applicable) */
188
sourceObservationId?: string | null;
189
190
/** Status of the item (ACTIVE or ARCHIVED) */
191
status: "ACTIVE" | "ARCHIVED";
192
193
/** Timestamp when the item was created */
194
createdAt: string;
195
196
/** Timestamp when the item was last updated */
197
updatedAt: string;
198
}
199
```
200
201
### LinkDatasetItemFunction
202
203
Function type for linking dataset items to OpenTelemetry spans for tracking experiments.
204
205
```typescript { .api }
206
/**
207
* Links dataset items to OpenTelemetry spans
208
*
209
* Creates a connection between a dataset item and a trace/observation,
210
* enabling tracking of which dataset items were used in which experiments.
211
* This is essential for creating dataset runs and tracking experiment lineage.
212
*
213
* @param obj - Object containing the OpenTelemetry span
214
* @param obj.otelSpan - The OpenTelemetry span from a Langfuse observation
215
* @param runName - Name of the experiment run for grouping related items
216
* @param runArgs - Optional configuration for the dataset run
217
* @returns Promise resolving to the created dataset run item
218
*
219
* @public
220
*/
221
type LinkDatasetItemFunction = (
222
obj: { otelSpan: Span },
223
runName: string,
224
runArgs?: {
225
/** Description of the dataset run */
226
description?: string;
227
228
/** Additional metadata for the dataset run */
229
metadata?: any;
230
}
231
) => Promise<DatasetRunItem>;
232
```
233
234
### DatasetRunItem
235
236
Result of linking a dataset item to a trace execution.
237
238
```typescript { .api }
239
/**
240
* Linked dataset run item
241
*
242
* Represents the connection between a dataset item and a specific
243
* trace execution within a dataset run. Used for tracking experiment results.
244
*
245
* @public
246
*/
247
interface DatasetRunItem {
248
/** Unique identifier for the run item */
249
id: string;
250
251
/** ID of the dataset run this item belongs to */
252
datasetRunId: string;
253
254
/** Name of the dataset run this item belongs to */
255
datasetRunName: string;
256
257
/** ID of the dataset item */
258
datasetItemId: string;
259
260
/** ID of the trace this run item is linked to */
261
traceId: string;
262
263
/** Optional ID of the observation this run item is linked to */
264
observationId?: string;
265
266
/** Timestamp when the run item was created */
267
createdAt: string;
268
269
/** Timestamp when the run item was last updated */
270
updatedAt: string;
271
}
272
```
273
274
### RunExperimentOnDataset
275
276
Function type for running experiments directly on fetched datasets.
277
278
```typescript { .api }
279
/**
280
* Runs experiments on Langfuse datasets
281
*
282
* This function type is attached to fetched datasets to enable convenient
283
* experiment execution. The data parameter is automatically provided from
284
* the dataset items.
285
*
286
* @param params - Experiment parameters (excluding data)
287
* @returns Promise resolving to experiment results
288
*
289
* @public
290
*/
291
type RunExperimentOnDataset = (
292
params: Omit<ExperimentParams<any, any, Record<string, any>>, "data">
293
) => Promise<ExperimentResult<any, any, Record<string, any>>>;
294
```
295
296
## Usage Patterns
297
298
### Basic Dataset Retrieval and Exploration
299
300
```typescript
301
import { LangfuseClient } from '@langfuse/client';
302
303
const langfuse = new LangfuseClient({
304
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
305
secretKey: process.env.LANGFUSE_SECRET_KEY,
306
});
307
308
// Fetch dataset
309
const dataset = await langfuse.dataset.get("customer-support-qa");
310
311
console.log(`Dataset: ${dataset.name}`);
312
console.log(`Total items: ${dataset.items.length}`);
313
314
// Explore items
315
dataset.items.forEach((item, index) => {
316
console.log(`\nItem ${index + 1}:`);
317
console.log(' Input:', item.input);
318
console.log(' Expected:', item.expectedOutput);
319
320
if (item.metadata) {
321
console.log(' Metadata:', item.metadata);
322
}
323
});
324
```
325
326
### Linking Dataset Items to Traces
327
328
Link dataset items to trace executions to create dataset runs and track experiment results.
329
330
```typescript
331
import { LangfuseClient } from '@langfuse/client';
332
import { startObservation } from '@langfuse/tracing';
333
334
const langfuse = new LangfuseClient();
335
336
// Fetch dataset
337
const dataset = await langfuse.dataset.get("qa-benchmark");
338
const runName = "gpt-4-evaluation-v1";
339
340
// Process each item and link to traces
341
for (const item of dataset.items) {
342
// Create a trace for this execution
343
const span = startObservation("qa-task", {
344
input: item.input,
345
metadata: { datasetItemId: item.id }
346
});
347
348
try {
349
// Execute your task
350
const output = await runYourTask(item.input);
351
352
// Update trace with output
353
span.update({ output });
354
355
// Link dataset item to this trace
356
await item.link(span, runName);
357
358
} catch (error) {
359
// Handle errors
360
span.update({
361
output: { error: String(error) },
362
level: "ERROR"
363
});
364
365
// Still link the item (to track failures)
366
await item.link(span, runName);
367
} finally {
368
span.end();
369
}
370
}
371
372
console.log(`Completed dataset run: ${runName}`);
373
```
374
375
### Linking with Run Metadata
376
377
Add descriptions and metadata to dataset runs for better organization.
378
379
```typescript
380
const dataset = await langfuse.dataset.get("model-comparison");
381
const runName = "claude-3-opus-eval";
382
383
for (const item of dataset.items) {
384
const span = startObservation("evaluation-task", {
385
input: item.input
386
});
387
388
const output = await evaluateWithClaude(item.input);
389
span.update({ output });
390
span.end();
391
392
// Link with descriptive metadata
393
await item.link(span, runName, {
394
description: "Claude 3 Opus evaluation on reasoning tasks",
395
metadata: {
396
modelVersion: "claude-3-opus-20240229",
397
temperature: 0.7,
398
maxTokens: 1000,
399
timestamp: new Date().toISOString(),
400
experimentGroup: "reasoning-tasks"
401
}
402
});
403
}
404
```
405
406
### Linking Nested Observations
407
408
Link dataset items to specific observations within a trace hierarchy.
409
410
```typescript
411
const dataset = await langfuse.dataset.get("translation-dataset");
412
const runName = "translation-pipeline-v2";
413
414
for (const item of dataset.items) {
415
// Create parent trace
416
const trace = startObservation("translation-pipeline", {
417
input: item.input
418
});
419
420
// Create preprocessing observation
421
const preprocessor = trace.startObservation("preprocessing", {
422
input: item.input
423
});
424
const preprocessed = await preprocess(item.input);
425
preprocessor.update({ output: preprocessed });
426
preprocessor.end();
427
428
// Create translation observation (the main task)
429
const translator = trace.startObservation("translation", {
430
input: preprocessed,
431
model: "gpt-4"
432
}, { asType: "generation" });
433
434
const translated = await translate(preprocessed);
435
translator.update({ output: translated });
436
translator.end();
437
438
// Create postprocessing observation
439
const postprocessor = trace.startObservation("postprocessing", {
440
input: translated
441
});
442
const final = await postprocess(translated);
443
postprocessor.update({ output: final });
444
postprocessor.end();
445
446
trace.update({ output: final });
447
trace.end();
448
449
// Link to the specific translation observation
450
await item.link({ otelSpan: translator.otelSpan }, runName, {
451
description: "Translation quality evaluation",
452
metadata: { pipeline: "v2", stage: "translation" }
453
});
454
}
455
```
456
457
### Running Experiments on Datasets
458
459
Execute experiments directly on datasets with automatic tracing and evaluation.
460
461
```typescript
462
import { LangfuseClient } from '@langfuse/client';
463
import { observeOpenAI } from '@langfuse/openai';
464
import OpenAI from 'openai';
465
466
const langfuse = new LangfuseClient();
467
468
// Fetch dataset
469
const dataset = await langfuse.dataset.get("capital-cities");
470
471
// Define task
472
const task = async ({ input }: { input: string }) => {
473
const client = observeOpenAI(new OpenAI());
474
475
const response = await client.chat.completions.create({
476
model: "gpt-4",
477
messages: [
478
{ role: "user", content: `What is the capital of ${input}?` }
479
]
480
});
481
482
return response.choices[0].message.content;
483
};
484
485
// Define evaluator
486
const exactMatchEvaluator = async ({ output, expectedOutput }) => ({
487
name: "exact_match",
488
value: output === expectedOutput ? 1 : 0
489
});
490
491
// Run experiment
492
const result = await dataset.runExperiment({
493
name: "Capital Cities Evaluation",
494
runName: "gpt-4-baseline",
495
description: "Baseline evaluation with GPT-4",
496
task,
497
evaluators: [exactMatchEvaluator],
498
maxConcurrency: 5
499
});
500
501
// View results
502
console.log(await result.format());
503
console.log(`Dataset run URL: ${result.datasetRunUrl}`);
504
```
505
506
### Advanced Experiment with Multiple Evaluators
507
508
```typescript
509
import { LangfuseClient, Evaluator } from '@langfuse/client';
510
import { createEvaluatorFromAutoevals } from '@langfuse/client';
511
import { Levenshtein, Factuality } from 'autoevals';
512
513
const langfuse = new LangfuseClient();
514
const dataset = await langfuse.dataset.get("qa-dataset");
515
516
// Custom evaluator using OpenAI
517
const semanticSimilarityEvaluator: Evaluator = async ({
518
output,
519
expectedOutput
520
}) => {
521
const openai = new OpenAI();
522
523
const response = await openai.chat.completions.create({
524
model: "gpt-4",
525
messages: [
526
{
527
role: "user",
528
content: `Rate the semantic similarity between these two answers on a scale of 0 to 1:
529
530
Answer 1: ${output}
531
Answer 2: ${expectedOutput}
532
533
Respond with just a number between 0 and 1.`
534
}
535
]
536
});
537
538
const score = parseFloat(response.choices[0].message.content || "0");
539
540
return {
541
name: "semantic_similarity",
542
value: score,
543
comment: `Comparison between output and expected output`
544
};
545
};
546
547
// Run experiment with multiple evaluators
548
const result = await dataset.runExperiment({
549
name: "Multi-Evaluator Experiment",
550
runName: "comprehensive-eval-v1",
551
task: myTask,
552
evaluators: [
553
// AutoEvals evaluators
554
createEvaluatorFromAutoevals(Levenshtein),
555
createEvaluatorFromAutoevals(Factuality),
556
557
// Custom evaluator
558
semanticSimilarityEvaluator
559
]
560
});
561
562
// Analyze results
563
console.log(await result.format({ includeItemResults: true }));
564
565
// Access individual scores
566
result.itemResults.forEach((item, index) => {
567
console.log(`\nItem ${index + 1}:`);
568
console.log('Input:', item.input);
569
console.log('Output:', item.output);
570
console.log('Expected:', item.expectedOutput);
571
console.log('Evaluations:');
572
573
item.evaluations.forEach(evaluation => {
574
console.log(` ${evaluation.name}: ${evaluation.value}`);
575
if (evaluation.comment) {
576
console.log(` Comment: ${evaluation.comment}`);
577
}
578
});
579
});
580
```
581
582
### Experiment with Run-Level Evaluators
583
584
Use run-level evaluators to compute aggregate statistics across all items.
585
586
```typescript
587
import { LangfuseClient, RunEvaluator } from '@langfuse/client';
588
589
const langfuse = new LangfuseClient();
590
const dataset = await langfuse.dataset.get("benchmark-dataset");
591
592
// Define a run-level evaluator for computing averages
593
const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {
594
const scores = itemResults
595
.flatMap(result => result.evaluations)
596
.filter(eval => eval.name === "accuracy")
597
.map(eval => eval.value as number);
598
599
const average = scores.reduce((sum, score) => sum + score, 0) / scores.length;
600
601
return {
602
name: "average_accuracy",
603
value: average,
604
comment: `Average accuracy across ${scores.length} items`
605
};
606
};
607
608
// Run experiment
609
const result = await dataset.runExperiment({
610
name: "Accuracy Benchmark",
611
task: myTask,
612
evaluators: [accuracyEvaluator],
613
runEvaluators: [averageScoreEvaluator]
614
});
615
616
// Check aggregate results
617
console.log('Run-level evaluations:');
618
result.runEvaluations.forEach(evaluation => {
619
console.log(`${evaluation.name}: ${evaluation.value}`);
620
if (evaluation.comment) {
621
console.log(` ${evaluation.comment}`);
622
}
623
});
624
```
625
626
### Comparing Multiple Models
627
628
Run experiments on the same dataset with different models for comparison.
629
630
```typescript
631
import { LangfuseClient } from '@langfuse/client';
632
import OpenAI from 'openai';
633
634
const langfuse = new LangfuseClient();
635
const dataset = await langfuse.dataset.get("reasoning-tasks");
636
637
const openai = new OpenAI();
638
639
// Define models to compare
640
const models = [
641
"gpt-4",
642
"gpt-3.5-turbo",
643
"gpt-4-turbo-preview"
644
];
645
646
const evaluator = async ({ output, expectedOutput }) => ({
647
name: "correctness",
648
value: evaluateCorrectness(output, expectedOutput)
649
});
650
651
// Run experiment for each model
652
const results = [];
653
654
for (const model of models) {
655
const result = await dataset.runExperiment({
656
name: "Model Comparison",
657
runName: `${model}-evaluation`,
658
description: `Evaluation with ${model}`,
659
metadata: { model },
660
task: async ({ input }) => {
661
const response = await openai.chat.completions.create({
662
model,
663
messages: [{ role: "user", content: input }]
664
});
665
return response.choices[0].message.content;
666
},
667
evaluators: [evaluator],
668
maxConcurrency: 3
669
});
670
671
results.push({ model, result });
672
console.log(`Completed: ${model}`);
673
console.log(await result.format());
674
}
675
676
// Compare results
677
console.log("\n=== Model Comparison Summary ===");
678
results.forEach(({ model, result }) => {
679
const avgScore = result.itemResults
680
.flatMap(r => r.evaluations)
681
.reduce((sum, e) => sum + (e.value as number), 0) / result.itemResults.length;
682
683
console.log(`${model}: ${avgScore.toFixed(3)}`);
684
console.log(` URL: ${result.datasetRunUrl}`);
685
});
686
```
687
688
### Incremental Dataset Processing
689
690
Process datasets incrementally with checkpointing for long-running experiments.
691
692
```typescript
693
import { LangfuseClient } from '@langfuse/client';
694
import { startObservation } from '@langfuse/tracing';
695
import * as fs from 'fs';
696
697
const langfuse = new LangfuseClient();
698
const dataset = await langfuse.dataset.get("large-dataset");
699
const runName = "incremental-processing-v1";
700
701
// Load checkpoint if exists
702
const checkpointFile = './checkpoint.json';
703
let processedIds = new Set<string>();
704
705
if (fs.existsSync(checkpointFile)) {
706
const checkpoint = JSON.parse(fs.readFileSync(checkpointFile, 'utf-8'));
707
processedIds = new Set(checkpoint.processedIds);
708
console.log(`Resuming from checkpoint: ${processedIds.size} items processed`);
709
}
710
711
// Process items
712
for (const [index, item] of dataset.items.entries()) {
713
// Skip already processed items
714
if (processedIds.has(item.id)) {
715
continue;
716
}
717
718
console.log(`Processing item ${index + 1}/${dataset.items.length}`);
719
720
try {
721
const span = startObservation("processing-task", {
722
input: item.input,
723
metadata: { itemId: item.id }
724
});
725
726
const output = await processItem(item.input);
727
span.update({ output });
728
span.end();
729
730
await item.link(span, runName, {
731
metadata: { batchIndex: Math.floor(index / 100) }
732
});
733
734
// Update checkpoint
735
processedIds.add(item.id);
736
fs.writeFileSync(
737
checkpointFile,
738
JSON.stringify({ processedIds: Array.from(processedIds) })
739
);
740
741
} catch (error) {
742
console.error(`Error processing item ${item.id}:`, error);
743
// Continue with next item
744
}
745
}
746
747
console.log(`Completed processing ${processedIds.size} items`);
748
749
// Clean up checkpoint
750
fs.unlinkSync(checkpointFile);
751
```
752
753
### Parallel Processing with Concurrency Control
754
755
```typescript
756
import { LangfuseClient } from '@langfuse/client';
757
import { startObservation } from '@langfuse/tracing';
758
import pLimit from 'p-limit';
759
760
const langfuse = new LangfuseClient();
761
const dataset = await langfuse.dataset.get("parallel-dataset");
762
const runName = "parallel-processing-v1";
763
764
// Limit concurrent operations
765
const limit = pLimit(10);
766
767
// Process items in parallel with concurrency limit
768
const tasks = dataset.items.map(item =>
769
limit(async () => {
770
const span = startObservation("parallel-task", {
771
input: item.input
772
});
773
774
try {
775
const output = await processItem(item.input);
776
span.update({ output });
777
778
await item.link(span, runName);
779
780
return { success: true, itemId: item.id };
781
} catch (error) {
782
span.update({
783
output: { error: String(error) },
784
level: "ERROR"
785
});
786
787
await item.link(span, runName);
788
789
return { success: false, itemId: item.id, error };
790
} finally {
791
span.end();
792
}
793
})
794
);
795
796
// Wait for all tasks to complete
797
const results = await Promise.all(tasks);
798
799
// Summarize results
800
const successful = results.filter(r => r.success).length;
801
const failed = results.filter(r => !r.success).length;
802
803
console.log(`Completed: ${successful} successful, ${failed} failed`);
804
```
805
806
### Integration with LangChain
807
808
Use datasets with LangChain applications for systematic evaluation.
809
810
```typescript
811
import { LangfuseClient } from '@langfuse/client';
812
import { startObservation } from '@langfuse/tracing';
813
import { ChatOpenAI } from '@langchain/openai';
814
import { PromptTemplate } from '@langchain/core/prompts';
815
import { StringOutputParser } from '@langchain/core/output_parsers';
816
817
const langfuse = new LangfuseClient();
818
const dataset = await langfuse.dataset.get("langchain-eval");
819
820
// Create LangChain components
821
const prompt = PromptTemplate.fromTemplate(
822
"Translate the following to French: {text}"
823
);
824
const model = new ChatOpenAI({ modelName: "gpt-4" });
825
const outputParser = new StringOutputParser();
826
const chain = prompt.pipe(model).pipe(outputParser);
827
828
const runName = "langchain-translation-eval";
829
830
// Process each dataset item
831
for (const item of dataset.items) {
832
// Create trace for this execution
833
const span = startObservation("langchain-execution", {
834
input: { text: item.input },
835
metadata: { chainType: "translation" }
836
});
837
838
try {
839
// Execute chain
840
const result = await chain.invoke({ text: item.input });
841
842
// Update trace with output
843
span.update({ output: result });
844
845
// Link dataset item
846
await item.link(span, runName, {
847
description: "LangChain translation evaluation"
848
});
849
850
// Score the result
851
langfuse.score.observation(span, {
852
name: "translation_quality",
853
value: computeQuality(result, item.expectedOutput)
854
});
855
856
} catch (error) {
857
span.update({
858
output: { error: String(error) },
859
level: "ERROR"
860
});
861
862
await item.link(span, runName);
863
}
864
865
span.end();
866
}
867
868
// Flush scores
869
await langfuse.flush();
870
```
871
872
### Using Dataset Experiments with Custom Data Structures
873
874
```typescript
875
import { LangfuseClient } from '@langfuse/client';
876
877
const langfuse = new LangfuseClient();
878
879
// Fetch dataset with structured inputs
880
const dataset = await langfuse.dataset.get("structured-qa");
881
882
// Task that handles structured input
883
const task = async ({ input }) => {
884
// Input is an object with specific structure
885
const { question, context } = input;
886
887
const response = await callLLM({
888
systemPrompt: "Answer questions based on the context.",
889
userPrompt: `Context: ${context}\n\nQuestion: ${question}`
890
});
891
892
return response;
893
};
894
895
// Evaluator that handles structured output
896
const evaluator = async ({ input, output, expectedOutput }) => {
897
const { question } = input;
898
899
// Complex evaluation logic
900
const scores = {
901
accuracy: evaluateAccuracy(output, expectedOutput),
902
relevance: evaluateRelevance(output, question),
903
completeness: evaluateCompleteness(output, expectedOutput)
904
};
905
906
// Return multiple evaluations
907
return [
908
{ name: "accuracy", value: scores.accuracy },
909
{ name: "relevance", value: scores.relevance },
910
{ name: "completeness", value: scores.completeness },
911
{
912
name: "overall",
913
value: (scores.accuracy + scores.relevance + scores.completeness) / 3,
914
metadata: { breakdown: scores }
915
}
916
];
917
};
918
919
// Run experiment
920
const result = await dataset.runExperiment({
921
name: "Structured QA Evaluation",
922
task,
923
evaluators: [evaluator]
924
});
925
926
console.log(await result.format({ includeItemResults: true }));
927
```
928
929
## Best Practices
930
931
### Dataset Organization
932
933
- **Use descriptive names**: Name datasets clearly to indicate their purpose (e.g., "customer-support-qa-v2", "translation-benchmark-2024")
934
- **Add metadata**: Include relevant context in dataset and item metadata for filtering and analysis
935
- **Version datasets**: Create new dataset versions when making significant changes rather than modifying existing ones
936
- **Document expected outputs**: Always provide expected outputs when available to enable automatic evaluation
937
938
### Linking Strategy
939
940
- **Consistent run names**: Use consistent naming conventions for dataset runs (e.g., "model-name-YYYY-MM-DD-version")
941
- **Add descriptions**: Include run descriptions to document the purpose and configuration of each evaluation
942
- **Use metadata**: Attach relevant metadata (model versions, hyperparameters, etc.) to enable comparison and filtering
943
- **Link to specific observations**: When evaluating specific steps in a pipeline, link to the relevant observation rather than the root trace
944
945
### Performance Optimization
946
947
- **Adjust page size**: For large datasets, tune `fetchItemsPageSize` based on your network and memory constraints
948
- **Control concurrency**: Use `maxConcurrency` in experiments to avoid overwhelming APIs or resources
949
- **Batch processing**: Process large datasets in batches with checkpointing for resilience
950
- **Parallel execution**: Use parallel processing with concurrency limits for faster evaluation
951
952
### Experiment Design
953
954
- **Start simple**: Begin with basic evaluators and add complexity as needed
955
- **Use multiple evaluators**: Combine different evaluation approaches (exact match, semantic similarity, factuality, etc.)
956
- **Include run-level evaluators**: Compute aggregate statistics to understand overall performance
957
- **Track metadata**: Include model versions, timestamps, and configuration in experiment metadata
958
- **Version experiments**: Use versioned run names to track experiment iterations
959
960
### Error Handling
961
962
- **Handle failures gracefully**: Catch errors during task execution and still link items to track failures
963
- **Set appropriate timeouts**: Configure reasonable timeouts to prevent hanging on slow operations
964
- **Log errors**: Record error details in trace metadata for debugging
965
- **Continue on failure**: Design experiments to continue processing remaining items even if some fail
966
967
### Cost Management
968
969
- **Control concurrency**: Limit concurrent API calls to manage rate limits and costs
970
- **Cache results**: Store experiment results to avoid re-running expensive evaluations
971
- **Sample testing**: Test on a subset of items before running full evaluations
972
- **Monitor usage**: Track token usage and API calls through Langfuse traces
973
974
## Integration with Experiments
975
976
Datasets integrate seamlessly with the experiment system. For detailed information about experiment execution, evaluators, and result analysis, see the [Experiment Management documentation](./experiments.md).
977
978
### Key Integration Points
979
980
- **Automatic tracing**: Experiments on datasets automatically create traces and link them to dataset runs
981
- **Dataset run tracking**: All experiment executions on datasets are tracked as dataset runs in Langfuse
982
- **Result visualization**: Dataset run results are available in the Langfuse UI with detailed analytics
983
- **Comparison tools**: Compare multiple dataset runs to track improvements over time
984
985
## Related APIs
986
987
- **[Experiment Management](./experiments.md)**: Run experiments with tasks and evaluators
988
- **[Tracing](./tracing.md)**: Create and manage traces and observations
989
- **[Score Management](./scores.md)**: Add scores to traces and observations
990
- **[Client](./client.md)**: Initialize and configure the Langfuse client
991