0
# Experiment Execution
1
2
The Experiment Execution system provides a comprehensive framework for running experiments that test models or tasks against datasets, with support for automatic evaluation, scoring, tracing, and result analysis. It enables systematic testing, comparison, and evaluation of AI models and prompts.
3
4
## Capabilities
5
6
### Run Experiment
7
8
Execute an experiment by running a task on each data item and evaluating the results with full tracing integration.
9
10
```typescript { .api }
11
/**
12
* Executes an experiment by running a task on each data item and evaluating the results
13
*
14
* This method orchestrates the complete experiment lifecycle:
15
* 1. Executes the task function on each data item with proper tracing
16
* 2. Runs item-level evaluators on each task output
17
* 3. Executes run-level evaluators on the complete result set
18
* 4. Links results to dataset runs (for Langfuse datasets)
19
* 5. Stores all scores and traces in Langfuse
20
*
21
* @param config - The experiment configuration
22
* @returns Promise that resolves to experiment results including itemResults, runEvaluations, and format function
23
*/
24
run<Input = any, ExpectedOutput = any, Metadata extends Record<string, any> = Record<string, any>>(
25
config: ExperimentParams<Input, ExpectedOutput, Metadata>
26
): Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>;
27
```
28
29
**Usage Examples:**
30
31
```typescript
32
import { LangfuseClient } from '@langfuse/client';
33
import OpenAI from 'openai';
34
35
const langfuse = new LangfuseClient();
36
const openai = new OpenAI();
37
38
// Basic experiment with custom data
39
const result = await langfuse.experiment.run({
40
name: "Capital Cities Test",
41
description: "Testing model knowledge of world capitals",
42
data: [
43
{ input: "France", expectedOutput: "Paris" },
44
{ input: "Germany", expectedOutput: "Berlin" },
45
{ input: "Japan", expectedOutput: "Tokyo" }
46
],
47
task: async ({ input }) => {
48
const response = await openai.chat.completions.create({
49
model: "gpt-4",
50
messages: [{
51
role: "user",
52
content: `What is the capital of ${input}?`
53
}]
54
});
55
return response.choices[0].message.content;
56
},
57
evaluators: [
58
async ({ output, expectedOutput }) => ({
59
name: "exact_match",
60
value: output === expectedOutput ? 1 : 0
61
})
62
]
63
});
64
65
console.log(await result.format());
66
67
// Experiment on Langfuse dataset
68
const dataset = await langfuse.dataset.get("qa-dataset");
69
70
const datasetResult = await dataset.runExperiment({
71
name: "GPT-4 QA Evaluation",
72
description: "Testing GPT-4 on our QA dataset",
73
task: async ({ input }) => {
74
const response = await openai.chat.completions.create({
75
model: "gpt-4",
76
messages: [{ role: "user", content: input }]
77
});
78
return response.choices[0].message.content;
79
},
80
evaluators: [
81
async ({ output, expectedOutput }) => ({
82
name: "accuracy",
83
value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0,
84
comment: output === expectedOutput ? "Perfect match" : "Case-insensitive match"
85
})
86
]
87
});
88
89
// Multiple evaluators
90
const multiEvalResult = await langfuse.experiment.run({
91
name: "Translation Quality Test",
92
data: [
93
{ input: "Hello world", expectedOutput: "Hola mundo" },
94
{ input: "Good morning", expectedOutput: "Buenos dรญas" }
95
],
96
task: async ({ input }) => translateText(input, 'es'),
97
evaluators: [
98
// Evaluator 1: Exact match
99
async ({ output, expectedOutput }) => ({
100
name: "exact_match",
101
value: output === expectedOutput ? 1 : 0
102
}),
103
// Evaluator 2: BLEU score
104
async ({ output, expectedOutput }) => ({
105
name: "bleu_score",
106
value: calculateBleuScore(output, expectedOutput),
107
comment: "Translation quality metric"
108
}),
109
// Evaluator 3: Length similarity
110
async ({ output, expectedOutput }) => ({
111
name: "length_similarity",
112
value: Math.abs(output.length - expectedOutput.length) <= 5 ? 1 : 0
113
})
114
]
115
});
116
117
// Run-level evaluators for aggregate analysis
118
const aggregateResult = await langfuse.experiment.run({
119
name: "Sentiment Classification",
120
data: sentimentDataset,
121
task: classifysentiment,
122
evaluators: [
123
async ({ output, expectedOutput }) => ({
124
name: "accuracy",
125
value: output === expectedOutput ? 1 : 0
126
})
127
],
128
runEvaluators: [
129
// Average accuracy across all items
130
async ({ itemResults }) => {
131
const accuracyScores = itemResults
132
.flatMap(r => r.evaluations)
133
.filter(e => e.name === "accuracy")
134
.map(e => e.value as number);
135
136
const average = accuracyScores.reduce((a, b) => a + b, 0) / accuracyScores.length;
137
138
return {
139
name: "average_accuracy",
140
value: average,
141
comment: `Overall accuracy: ${(average * 100).toFixed(1)}%`
142
};
143
},
144
// Precision calculation
145
async ({ itemResults }) => {
146
let truePositives = 0;
147
let falsePositives = 0;
148
149
for (const result of itemResults) {
150
if (result.output === "positive") {
151
if (result.expectedOutput === "positive") {
152
truePositives++;
153
} else {
154
falsePositives++;
155
}
156
}
157
}
158
159
const precision = truePositives / (truePositives + falsePositives);
160
161
return {
162
name: "precision",
163
value: precision,
164
comment: `Precision for positive class: ${(precision * 100).toFixed(1)}%`
165
};
166
}
167
]
168
});
169
170
// Concurrency control with maxConcurrency
171
const largeScaleResult = await langfuse.experiment.run({
172
name: "Large Scale Evaluation",
173
description: "Processing 1000 items with rate limiting",
174
data: largeDataset,
175
task: expensiveModelCall,
176
maxConcurrency: 5, // Process max 5 items simultaneously
177
evaluators: [accuracyEvaluator]
178
});
179
180
// Custom run name
181
const customRunResult = await langfuse.experiment.run({
182
name: "Model Comparison",
183
runName: "gpt-4-turbo-2024-01-15",
184
description: "Testing latest GPT-4 Turbo model",
185
data: testData,
186
task: myTask,
187
evaluators: [myEvaluator]
188
});
189
190
// With metadata
191
const metadataResult = await langfuse.experiment.run({
192
name: "Parameter Sweep",
193
metadata: {
194
model: "gpt-4",
195
temperature: 0.7,
196
max_tokens: 1000,
197
experiment_version: "v2.1"
198
},
199
data: testData,
200
task: myTask,
201
evaluators: [myEvaluator]
202
});
203
204
// Formatting results
205
const formattedResult = await langfuse.experiment.run({
206
name: "Test Run",
207
data: testData,
208
task: myTask,
209
evaluators: [myEvaluator]
210
});
211
212
// Format summary only (default)
213
console.log(await formattedResult.format());
214
215
// Format with detailed item results
216
console.log(await formattedResult.format({ includeItemResults: true }));
217
218
// Access raw results
219
console.log(`Processed ${formattedResult.itemResults.length} items`);
220
console.log(`Run evaluations:`, formattedResult.runEvaluations);
221
console.log(`Dataset run URL:`, formattedResult.datasetRunUrl);
222
```
223
224
**OpenTelemetry Integration:**
225
226
The experiment system automatically integrates with OpenTelemetry for distributed tracing:
227
228
```typescript
229
import { LangfuseClient } from '@langfuse/client';
230
import { LangfuseTraceClient } from '@langfuse/tracing';
231
232
// Ensure OpenTelemetry is configured
233
const langfuse = new LangfuseClient();
234
235
// Experiments automatically create traces for each task execution
236
const result = await langfuse.experiment.run({
237
name: "Traced Experiment",
238
data: testData,
239
task: async ({ input }) => {
240
// This task execution is automatically wrapped in a trace
241
// with name "experiment-item-run"
242
const output = await processInput(input);
243
return output;
244
},
245
evaluators: [myEvaluator]
246
});
247
248
// Each item result includes trace information
249
for (const itemResult of result.itemResults) {
250
console.log(`Trace ID: ${itemResult.traceId}`);
251
const traceUrl = await langfuse.getTraceUrl(itemResult.traceId);
252
console.log(`View trace: ${traceUrl}`);
253
}
254
255
// Warning if OpenTelemetry is not set up
256
// The system will log:
257
// "OpenTelemetry has not been set up. Traces will not be sent to Langfuse."
258
```
259
260
**Error Handling:**
261
262
```typescript
263
// Task errors are caught and logged
264
const resilientResult = await langfuse.experiment.run({
265
name: "Resilient Experiment",
266
data: testData,
267
task: async ({ input }) => {
268
try {
269
return await riskyOperation(input);
270
} catch (error) {
271
// Task errors are caught, logged, and item is skipped
272
throw error;
273
}
274
},
275
evaluators: [
276
async ({ output, expectedOutput }) => {
277
try {
278
return {
279
name: "score",
280
value: calculateScore(output, expectedOutput)
281
};
282
} catch (error) {
283
// Evaluator errors are caught and logged
284
// Other evaluators continue to run
285
throw error;
286
}
287
}
288
]
289
});
290
291
// Result contains only successfully processed items
292
console.log(`Successfully processed: ${resilientResult.itemResults.length} items`);
293
294
// Run evaluators also handle errors gracefully
295
const robustResult = await langfuse.experiment.run({
296
name: "Robust Experiment",
297
data: testData,
298
task: myTask,
299
evaluators: [myEvaluator],
300
runEvaluators: [
301
async ({ itemResults }) => {
302
try {
303
return {
304
name: "aggregate_metric",
305
value: calculateAggregate(itemResults)
306
};
307
} catch (error) {
308
// Run evaluator errors are caught and logged
309
// Other run evaluators continue to run
310
throw error;
311
}
312
}
313
]
314
});
315
```
316
317
## Type Definitions
318
319
### ExperimentParams
320
321
Configuration parameters for experiment execution.
322
323
```typescript { .api }
324
type ExperimentParams<
325
Input = any,
326
ExpectedOutput = any,
327
Metadata extends Record<string, any> = Record<string, any>
328
> = {
329
/**
330
* Human-readable name for the experiment.
331
*
332
* This name will appear in Langfuse UI and experiment results.
333
* Choose a descriptive name that identifies the experiment's purpose.
334
*/
335
name: string;
336
337
/**
338
* Optional exact name for the experiment run.
339
*
340
* If provided, this will be used as the exact dataset run name if the data
341
* contains Langfuse dataset items. If not provided, this will default to
342
* the experiment name appended with an ISO timestamp.
343
*/
344
runName?: string;
345
346
/**
347
* Optional description explaining the experiment's purpose.
348
*
349
* Provide context about what you're testing, methodology, or goals.
350
* This helps with experiment tracking and result interpretation.
351
*/
352
description?: string;
353
354
/**
355
* Optional metadata to attach to the experiment run.
356
*
357
* Store additional context like model versions, hyperparameters,
358
* or any other relevant information for analysis and comparison.
359
*/
360
metadata?: Record<string, any>;
361
362
/**
363
* Array of data items to process.
364
*
365
* Can be either custom ExperimentItem[] or DatasetItem[] from Langfuse.
366
* Each item should contain input data and optionally expected output.
367
*/
368
data: ExperimentItem<Input, ExpectedOutput, Metadata>[];
369
370
/**
371
* The task function to execute on each data item.
372
*
373
* This function receives input data and produces output that will be evaluated.
374
* It should encapsulate the model or system being tested.
375
*/
376
task: ExperimentTask<Input, ExpectedOutput, Metadata>;
377
378
/**
379
* Optional array of evaluator functions to assess each item's output.
380
*
381
* Each evaluator receives input, output, and expected output (if available)
382
* and returns evaluation results. Multiple evaluators enable comprehensive assessment.
383
*/
384
evaluators?: Evaluator<Input, ExpectedOutput, Metadata>[];
385
386
/**
387
* Optional array of run-level evaluators to assess the entire experiment.
388
*
389
* These evaluators receive all item results and can perform aggregate analysis
390
* like calculating averages, detecting patterns, or statistical analysis.
391
*/
392
runEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[];
393
394
/**
395
* Maximum number of concurrent task executions (default: Infinity).
396
*
397
* Controls parallelism to manage resource usage and API rate limits.
398
* Set lower values for expensive operations or rate-limited services.
399
*/
400
maxConcurrency?: number;
401
};
402
```
403
404
**Usage Examples:**
405
406
```typescript
407
import type { ExperimentParams } from '@langfuse/client';
408
409
// Type-safe experiment configuration
410
const config: ExperimentParams<string, string> = {
411
name: "Capital Cities",
412
description: "Testing geography knowledge",
413
metadata: {
414
model: "gpt-4",
415
version: "v1.0"
416
},
417
data: [
418
{ input: "France", expectedOutput: "Paris" },
419
{ input: "Germany", expectedOutput: "Berlin" }
420
],
421
task: async ({ input }) => {
422
return await getCapital(input);
423
},
424
evaluators: [exactMatchEvaluator],
425
runEvaluators: [averageScoreEvaluator],
426
maxConcurrency: 3
427
};
428
429
await langfuse.experiment.run(config);
430
431
// Generic types for complex data
432
interface CustomInput {
433
question: string;
434
context: string[];
435
}
436
437
interface CustomOutput {
438
answer: string;
439
confidence: number;
440
}
441
442
interface CustomMetadata {
443
category: string;
444
difficulty: "easy" | "medium" | "hard";
445
}
446
447
const typedConfig: ExperimentParams<CustomInput, CustomOutput, CustomMetadata> = {
448
name: "QA with Context",
449
data: [
450
{
451
input: {
452
question: "What is AI?",
453
context: ["AI stands for Artificial Intelligence"]
454
},
455
expectedOutput: {
456
answer: "Artificial Intelligence",
457
confidence: 0.95
458
},
459
metadata: {
460
category: "technology",
461
difficulty: "easy"
462
}
463
}
464
],
465
task: async ({ input }) => {
466
// Type-safe input and output
467
return await qaModel(input.question, input.context);
468
},
469
evaluators: [
470
async ({ input, output, expectedOutput }) => {
471
// All parameters are fully typed
472
return {
473
name: "accuracy",
474
value: output.answer === expectedOutput?.answer ? 1 : 0
475
};
476
}
477
]
478
};
479
```
480
481
### ExperimentTask
482
483
Function type for experiment tasks that process input data and return output.
484
485
```typescript { .api }
486
/**
487
* Function type for experiment tasks that process input data and return output
488
*
489
* The task function is the core component being tested in an experiment.
490
* It receives either an ExperimentItem or DatasetItem and produces output
491
* that will be evaluated.
492
*
493
* @param params - Either an ExperimentItem or DatasetItem containing input and metadata
494
* @returns Promise resolving to the task's output (any type)
495
*/
496
type ExperimentTask<
497
Input = any,
498
ExpectedOutput = any,
499
Metadata extends Record<string, any> = Record<string, any>
500
> = (
501
params: ExperimentTaskParams<Input, ExpectedOutput, Metadata>
502
) => Promise<any>;
503
504
type ExperimentTaskParams<
505
Input = any,
506
ExpectedOutput = any,
507
Metadata extends Record<string, any> = Record<string, any>
508
> = ExperimentItem<Input, ExpectedOutput, Metadata>;
509
```
510
511
**Usage Examples:**
512
513
```typescript
514
import type { ExperimentTask } from '@langfuse/client';
515
516
// Simple task function
517
const simpleTask: ExperimentTask = async ({ input }) => {
518
return await processInput(input);
519
};
520
521
// Task with type safety
522
const typedTask: ExperimentTask<string, string> = async ({ input, metadata }) => {
523
// input is typed as string
524
// metadata is typed as Record<string, any>
525
return await processString(input);
526
};
527
528
// Task accessing expected output (for reference)
529
const referenceTask: ExperimentTask = async ({ input, expectedOutput }) => {
530
// Can access expectedOutput for context (but shouldn't use it for cheating!)
531
console.log(`Processing input, expecting: ${expectedOutput}`);
532
return await myModel(input);
533
};
534
535
// Task with custom types
536
interface QuestionInput {
537
question: string;
538
context: string;
539
}
540
541
const qaTask: ExperimentTask<QuestionInput, string> = async ({ input, metadata }) => {
542
const { question, context } = input;
543
return await answerQuestion(question, context);
544
};
545
546
// Task handling both ExperimentItem and DatasetItem
547
const universalTask: ExperimentTask = async (item) => {
548
// Works with both types
549
const input = item.input;
550
const meta = item.metadata || {};
551
552
// Check if it's a DatasetItem (has id and datasetId)
553
if ('id' in item && 'datasetId' in item) {
554
console.log(`Processing dataset item: ${item.id}`);
555
}
556
557
return await process(input, meta);
558
};
559
560
// Task with error handling
561
const robustTask: ExperimentTask = async ({ input }) => {
562
try {
563
return await riskyOperation(input);
564
} catch (error) {
565
console.error(`Task failed for input:`, input, error);
566
throw error; // Re-throw to skip this item
567
}
568
};
569
570
// Task with nested tracing
571
const tracedTask: ExperimentTask = async ({ input }) => {
572
// Nested operations are automatically traced
573
const step1 = await preprocessInput(input);
574
const step2 = await modelInference(step1);
575
const step3 = await postprocess(step2);
576
return step3;
577
};
578
```
579
580
### ExperimentItem
581
582
Data item type for experiment inputs, supporting both custom items and Langfuse dataset items.
583
584
```typescript { .api }
585
/**
586
* Experiment data item or dataset item
587
*
588
* Can be either a custom item with input/expectedOutput/metadata
589
* or a DatasetItem from Langfuse
590
*/
591
type ExperimentItem<
592
Input = any,
593
ExpectedOutput = any,
594
Metadata extends Record<string, any> = Record<string, any>
595
> =
596
| {
597
/**
598
* The input data to pass to the task function.
599
*
600
* Can be any type - string, object, array, etc. This data will be passed
601
* to your task function as the `input` parameter.
602
*/
603
input?: Input;
604
605
/**
606
* The expected output for evaluation purposes.
607
*
608
* Optional ground truth or reference output for this input.
609
* Used by evaluators to assess task performance.
610
*/
611
expectedOutput?: ExpectedOutput;
612
613
/**
614
* Optional metadata to attach to the experiment item.
615
*
616
* Store additional context, tags, or custom data related to this specific item.
617
* This metadata will be available in traces and evaluators.
618
*/
619
metadata?: Metadata;
620
}
621
| DatasetItem;
622
```
623
624
**Usage Examples:**
625
626
```typescript
627
import type { ExperimentItem } from '@langfuse/client';
628
629
// Simple string items
630
const stringItems: ExperimentItem<string, string>[] = [
631
{ input: "Hello", expectedOutput: "Hola" },
632
{ input: "Goodbye", expectedOutput: "Adiรณs" }
633
];
634
635
// Complex structured items
636
interface QAInput {
637
question: string;
638
context: string;
639
}
640
641
const qaItems: ExperimentItem<QAInput, string>[] = [
642
{
643
input: {
644
question: "What is AI?",
645
context: "AI stands for Artificial Intelligence..."
646
},
647
expectedOutput: "Artificial Intelligence"
648
}
649
];
650
651
// Items with metadata
652
const itemsWithMetadata: ExperimentItem<string, string, { category: string }>[] = [
653
{
654
input: "Test input",
655
expectedOutput: "Expected output",
656
metadata: {
657
category: "technology"
658
}
659
}
660
];
661
662
// Items without expected output (evaluation-only based on output)
663
const noExpectedOutput: ExperimentItem<string, never>[] = [
664
{ input: "Generate creative text" }
665
// No expectedOutput - evaluators won't have ground truth
666
];
667
668
// Mixed with Langfuse dataset items
669
const dataset = await langfuse.dataset.get("my-dataset");
670
const mixedItems: ExperimentItem[] = [
671
// Custom items
672
{ input: "Custom input", expectedOutput: "Custom output" },
673
// Dataset items
674
...dataset.items
675
];
676
677
// Accessing item properties in task
678
const task: ExperimentTask = async (item) => {
679
if ('id' in item && 'datasetId' in item) {
680
// It's a DatasetItem
681
console.log(`Dataset item ID: ${item.id}`);
682
console.log(`Dataset ID: ${item.datasetId}`);
683
}
684
685
return await process(item.input);
686
};
687
```
688
689
### Evaluator
690
691
Function type for item-level evaluators that assess individual task outputs.
692
693
```typescript { .api }
694
/**
695
* Evaluator function for item-level evaluation
696
*
697
* Receives input, output, expected output, and metadata,
698
* and returns evaluation results as Evaluation object(s).
699
*
700
* @param params - Parameters including input, output, expectedOutput, and metadata
701
* @returns Promise resolving to single Evaluation or array of Evaluations
702
*/
703
type Evaluator<
704
Input = any,
705
ExpectedOutput = any,
706
Metadata extends Record<string, any> = Record<string, any>
707
> = (
708
params: EvaluatorParams<Input, ExpectedOutput, Metadata>
709
) => Promise<Evaluation[] | Evaluation>;
710
711
type EvaluatorParams<
712
Input = any,
713
ExpectedOutput = any,
714
Metadata extends Record<string, any> = Record<string, any>
715
> = {
716
/**
717
* The original input data passed to the task.
718
*
719
* Use this for context-aware evaluations or input-output relationship analysis.
720
*/
721
input: Input;
722
723
/**
724
* The output produced by the task.
725
*
726
* This is the actual result returned by your task function.
727
*/
728
output: any;
729
730
/**
731
* The expected output for comparison (optional).
732
*
733
* This is the ground truth or expected result for the given input.
734
*/
735
expectedOutput?: ExpectedOutput;
736
};
737
```
738
739
**Usage Examples:**
740
741
```typescript
742
import type { Evaluator, Evaluation } from '@langfuse/client';
743
744
// Simple exact match evaluator
745
const exactMatch: Evaluator = async ({ output, expectedOutput }) => ({
746
name: "exact_match",
747
value: output === expectedOutput ? 1 : 0
748
});
749
750
// Case-insensitive match with comment
751
const caseInsensitiveMatch: Evaluator = async ({ output, expectedOutput }) => {
752
const match = output.toLowerCase() === expectedOutput?.toLowerCase();
753
return {
754
name: "case_insensitive_match",
755
value: match ? 1 : 0,
756
comment: match ? "Perfect match" : "No match"
757
};
758
};
759
760
// Evaluator returning multiple scores
761
const comprehensiveEvaluator: Evaluator = async ({ output, expectedOutput }) => {
762
return [
763
{
764
name: "exact_match",
765
value: output === expectedOutput ? 1 : 0
766
},
767
{
768
name: "length_match",
769
value: Math.abs(output.length - expectedOutput.length) <= 5 ? 1 : 0
770
},
771
{
772
name: "similarity",
773
value: calculateSimilarity(output, expectedOutput),
774
comment: "Cosine similarity score"
775
}
776
];
777
};
778
779
// Type-safe evaluator
780
const typedEvaluator: Evaluator<string, string> = async ({ input, output, expectedOutput }) => {
781
// All parameters are typed
782
return {
783
name: "accuracy",
784
value: output === expectedOutput ? 1 : 0,
785
metadata: { input_length: input.length }
786
};
787
};
788
789
// Evaluator using input context
790
const contextAwareEvaluator: Evaluator = async ({ input, output }) => {
791
const isValid = validateOutput(output, input);
792
return {
793
name: "validity",
794
value: isValid ? 1 : 0,
795
comment: isValid ? "Output valid for input" : "Output invalid"
796
};
797
};
798
799
// Evaluator with metadata
800
const categoryEvaluator: Evaluator<any, any, { category: string }> = async ({
801
output,
802
expectedOutput,
803
metadata
804
}) => {
805
const score = calculateScore(output, expectedOutput);
806
return {
807
name: "category_score",
808
value: score,
809
metadata: {
810
category: metadata?.category,
811
timestamp: new Date().toISOString()
812
}
813
};
814
};
815
816
// Evaluator with different data types
817
const numericEvaluator: Evaluator = async ({ output, expectedOutput }) => {
818
const error = Math.abs(output - expectedOutput);
819
return {
820
name: "absolute_error",
821
value: error,
822
dataType: "numeric",
823
comment: `Error: ${error.toFixed(2)}`
824
};
825
};
826
827
// Boolean evaluator
828
const booleanEvaluator: Evaluator = async ({ output }) => {
829
return {
830
name: "is_valid",
831
value: validateFormat(output),
832
dataType: "boolean"
833
};
834
};
835
836
// Evaluator with error handling
837
const robustEvaluator: Evaluator = async ({ output, expectedOutput }) => {
838
try {
839
const score = complexCalculation(output, expectedOutput);
840
return {
841
name: "complex_score",
842
value: score
843
};
844
} catch (error) {
845
console.error("Evaluator failed:", error);
846
throw error; // Will be caught and logged by experiment system
847
}
848
};
849
850
// LLM-as-judge evaluator
851
const llmJudgeEvaluator: Evaluator = async ({ input, output, expectedOutput }) => {
852
const judgmentPrompt = `
853
Input: ${input}
854
Expected: ${expectedOutput}
855
Actual: ${output}
856
857
Rate the quality from 0 to 1:
858
`;
859
860
const judgment = await llm.evaluate(judgmentPrompt);
861
862
return {
863
name: "llm_judgment",
864
value: parseFloat(judgment),
865
comment: `LLM evaluation of output quality`
866
};
867
};
868
```
869
870
### RunEvaluator
871
872
Function type for run-level evaluators that assess the entire experiment.
873
874
```typescript { .api }
875
/**
876
* Evaluator function for run-level evaluation
877
*
878
* Receives all item results and performs aggregate analysis
879
* across the entire experiment run.
880
*
881
* @param params - Parameters including all itemResults
882
* @returns Promise resolving to single Evaluation or array of Evaluations
883
*/
884
type RunEvaluator<
885
Input = any,
886
ExpectedOutput = any,
887
Metadata extends Record<string, any> = Record<string, any>
888
> = (
889
params: RunEvaluatorParams<Input, ExpectedOutput, Metadata>
890
) => Promise<Evaluation[] | Evaluation>;
891
892
type RunEvaluatorParams<
893
Input = any,
894
ExpectedOutput = any,
895
Metadata extends Record<string, any> = Record<string, any>
896
> = {
897
/**
898
* Results from all processed experiment items.
899
*
900
* Each item contains the input, output, evaluations, and metadata from
901
* processing a single data item. Use this for aggregate analysis,
902
* statistical calculations, and cross-item comparisons.
903
*/
904
itemResults: ExperimentItemResult<Input, ExpectedOutput, Metadata>[];
905
};
906
```
907
908
**Usage Examples:**
909
910
```typescript
911
import type { RunEvaluator } from '@langfuse/client';
912
913
// Average score evaluator
914
const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {
915
const scores = itemResults
916
.flatMap(r => r.evaluations)
917
.filter(e => e.name === "accuracy")
918
.map(e => e.value as number);
919
920
const average = scores.reduce((a, b) => a + b, 0) / scores.length;
921
922
return {
923
name: "average_accuracy",
924
value: average,
925
comment: `Average accuracy: ${(average * 100).toFixed(1)}%`
926
};
927
};
928
929
// Multiple run-level metrics
930
const comprehensiveRunEvaluator: RunEvaluator = async ({ itemResults }) => {
931
const accuracyScores = itemResults
932
.flatMap(r => r.evaluations)
933
.filter(e => e.name === "accuracy")
934
.map(e => e.value as number);
935
936
const average = accuracyScores.reduce((a, b) => a + b, 0) / accuracyScores.length;
937
const min = Math.min(...accuracyScores);
938
const max = Math.max(...accuracyScores);
939
const stdDev = calculateStdDev(accuracyScores);
940
941
return [
942
{
943
name: "average_accuracy",
944
value: average
945
},
946
{
947
name: "min_accuracy",
948
value: min
949
},
950
{
951
name: "max_accuracy",
952
value: max
953
},
954
{
955
name: "std_dev_accuracy",
956
value: stdDev,
957
comment: "Standard deviation of accuracy scores"
958
}
959
];
960
};
961
962
// Precision and recall
963
const precisionRecallEvaluator: RunEvaluator = async ({ itemResults }) => {
964
let truePositives = 0;
965
let falsePositives = 0;
966
let falseNegatives = 0;
967
968
for (const result of itemResults) {
969
if (result.output === "positive") {
970
if (result.expectedOutput === "positive") {
971
truePositives++;
972
} else {
973
falsePositives++;
974
}
975
} else if (result.expectedOutput === "positive") {
976
falseNegatives++;
977
}
978
}
979
980
const precision = truePositives / (truePositives + falsePositives);
981
const recall = truePositives / (truePositives + falseNegatives);
982
const f1 = 2 * (precision * recall) / (precision + recall);
983
984
return [
985
{
986
name: "precision",
987
value: precision,
988
comment: `Precision: ${(precision * 100).toFixed(1)}%`
989
},
990
{
991
name: "recall",
992
value: recall,
993
comment: `Recall: ${(recall * 100).toFixed(1)}%`
994
},
995
{
996
name: "f1_score",
997
value: f1,
998
comment: `F1 Score: ${(f1 * 100).toFixed(1)}%`
999
}
1000
];
1001
};
1002
1003
// Category-based analysis
1004
const categoryAnalysisEvaluator: RunEvaluator<any, any, { category: string }> = async ({
1005
itemResults
1006
}) => {
1007
const categories = new Map<string, number[]>();
1008
1009
for (const result of itemResults) {
1010
const category = result.item.metadata?.category || "unknown";
1011
const accuracy = result.evaluations.find(e => e.name === "accuracy")?.value as number;
1012
1013
if (!categories.has(category)) {
1014
categories.set(category, []);
1015
}
1016
categories.get(category)!.push(accuracy);
1017
}
1018
1019
const evaluations: Evaluation[] = [];
1020
1021
for (const [category, scores] of categories) {
1022
const average = scores.reduce((a, b) => a + b, 0) / scores.length;
1023
evaluations.push({
1024
name: `accuracy_${category}`,
1025
value: average,
1026
comment: `Average accuracy for ${category}: ${(average * 100).toFixed(1)}%`
1027
});
1028
}
1029
1030
return evaluations;
1031
};
1032
1033
// Percentile analysis
1034
const percentileEvaluator: RunEvaluator = async ({ itemResults }) => {
1035
const scores = itemResults
1036
.flatMap(r => r.evaluations)
1037
.filter(e => e.name === "score")
1038
.map(e => e.value as number)
1039
.sort((a, b) => a - b);
1040
1041
const p50 = scores[Math.floor(scores.length * 0.5)];
1042
const p90 = scores[Math.floor(scores.length * 0.9)];
1043
const p95 = scores[Math.floor(scores.length * 0.95)];
1044
1045
return [
1046
{ name: "p50_score", value: p50, comment: "Median score" },
1047
{ name: "p90_score", value: p90, comment: "90th percentile" },
1048
{ name: "p95_score", value: p95, comment: "95th percentile" }
1049
];
1050
};
1051
1052
// Failure analysis
1053
const failureAnalysisEvaluator: RunEvaluator = async ({ itemResults }) => {
1054
const failures = itemResults.filter(r => {
1055
const accuracy = r.evaluations.find(e => e.name === "accuracy")?.value;
1056
return accuracy === 0;
1057
});
1058
1059
const failureRate = failures.length / itemResults.length;
1060
1061
return {
1062
name: "failure_rate",
1063
value: failureRate,
1064
comment: `${failures.length} of ${itemResults.length} items failed (${(failureRate * 100).toFixed(1)}%)`
1065
};
1066
};
1067
1068
// Cross-item consistency
1069
const consistencyEvaluator: RunEvaluator = async ({ itemResults }) => {
1070
// Check if similar inputs produce similar outputs
1071
const consistency = analyzeConsistency(itemResults);
1072
1073
return {
1074
name: "consistency_score",
1075
value: consistency,
1076
comment: "Consistency across similar inputs"
1077
};
1078
};
1079
```
1080
1081
### Evaluation
1082
1083
Result type for evaluations returned by evaluator functions.
1084
1085
```typescript { .api }
1086
/**
1087
* Evaluation result from an evaluator
1088
*
1089
* Contains the score name, value, and optional metadata/comment
1090
*/
1091
type Evaluation = Pick<
1092
ScoreBody,
1093
"name" | "value" | "comment" | "metadata" | "dataType"
1094
>;
1095
1096
interface Evaluation {
1097
/**
1098
* Name of the evaluation metric
1099
*
1100
* Should be descriptive and unique within the evaluator set.
1101
*/
1102
name: string;
1103
1104
/**
1105
* Numeric or boolean value of the evaluation
1106
*
1107
* Typically 0-1 for accuracy/similarity scores, but can be any numeric value.
1108
*/
1109
value: number | boolean;
1110
1111
/**
1112
* Optional human-readable comment about the evaluation
1113
*
1114
* Useful for explaining the score or providing context.
1115
*/
1116
comment?: string;
1117
1118
/**
1119
* Optional metadata about the evaluation
1120
*
1121
* Store additional context or debugging information.
1122
*/
1123
metadata?: Record<string, any>;
1124
1125
/**
1126
* Optional data type specification
1127
*
1128
* Specifies how the value should be interpreted.
1129
*/
1130
dataType?: "numeric" | "boolean" | "categorical";
1131
}
1132
```
1133
1134
**Usage Examples:**
1135
1136
```typescript
1137
import type { Evaluation } from '@langfuse/client';
1138
1139
// Simple numeric evaluation
1140
const simpleEval: Evaluation = {
1141
name: "accuracy",
1142
value: 0.85
1143
};
1144
1145
// Boolean evaluation
1146
const booleanEval: Evaluation = {
1147
name: "passed",
1148
value: true,
1149
dataType: "boolean"
1150
};
1151
1152
// Evaluation with comment
1153
const commentedEval: Evaluation = {
1154
name: "similarity",
1155
value: 0.92,
1156
comment: "High similarity between output and expected"
1157
};
1158
1159
// Evaluation with metadata
1160
const metadataEval: Evaluation = {
1161
name: "response_quality",
1162
value: 0.88,
1163
metadata: {
1164
model: "gpt-4",
1165
temperature: 0.7,
1166
tokens: 150
1167
},
1168
comment: "Quality assessment using LLM judge"
1169
};
1170
1171
// Multiple evaluation types
1172
const multiEval: Evaluation[] = [
1173
{
1174
name: "exact_match",
1175
value: 1,
1176
dataType: "boolean"
1177
},
1178
{
1179
name: "similarity",
1180
value: 0.95,
1181
dataType: "numeric",
1182
comment: "Cosine similarity"
1183
},
1184
{
1185
name: "category",
1186
value: 0,
1187
dataType: "categorical",
1188
metadata: { predicted: "A", actual: "B" }
1189
}
1190
];
1191
```
1192
1193
### ExperimentResult
1194
1195
Complete result structure returned by the run() method.
1196
1197
```typescript { .api }
1198
/**
1199
* Complete result of an experiment execution
1200
*
1201
* Contains all results from processing the experiment data,
1202
* including individual item results, run-level evaluations,
1203
* and utilities for result visualization.
1204
*/
1205
type ExperimentResult<
1206
Input = any,
1207
ExpectedOutput = any,
1208
Metadata extends Record<string, any> = Record<string, any>
1209
> = {
1210
/**
1211
* The experiment run name.
1212
*
1213
* Either the provided runName parameter or generated name (experiment name + timestamp).
1214
*/
1215
runName: string;
1216
1217
/**
1218
* ID of the dataset run in Langfuse (only for experiments on Langfuse datasets).
1219
*
1220
* Use this ID to access the dataset run via the Langfuse API or UI.
1221
*/
1222
datasetRunId?: string;
1223
1224
/**
1225
* URL to the dataset run in the Langfuse UI (only for experiments on Langfuse datasets).
1226
*
1227
* Direct link to view the complete dataset run in the Langfuse web interface.
1228
*/
1229
datasetRunUrl?: string;
1230
1231
/**
1232
* Results from processing each individual data item.
1233
*
1234
* Contains the complete results for every item in your experiment data.
1235
*/
1236
itemResults: ExperimentItemResult<Input, ExpectedOutput, Metadata>[];
1237
1238
/**
1239
* Results from run-level evaluators that assessed the entire experiment.
1240
*
1241
* Contains aggregate evaluations that analyze the complete experiment.
1242
*/
1243
runEvaluations: Evaluation[];
1244
1245
/**
1246
* Function to format experiment results in a human-readable format.
1247
*
1248
* @param options - Formatting options
1249
* @param options.includeItemResults - Whether to include individual item details (default: false)
1250
* @returns Promise resolving to formatted string representation
1251
*/
1252
format: (options?: { includeItemResults?: boolean }) => Promise<string>;
1253
};
1254
```
1255
1256
**Usage Examples:**
1257
1258
```typescript
1259
import type { ExperimentResult } from '@langfuse/client';
1260
1261
// Run experiment and access results
1262
const result: ExperimentResult = await langfuse.experiment.run({
1263
name: "Test Experiment",
1264
data: testData,
1265
task: myTask,
1266
evaluators: [accuracyEvaluator],
1267
runEvaluators: [averageEvaluator]
1268
});
1269
1270
// Access run name
1271
console.log(`Run name: ${result.runName}`);
1272
// "Test Experiment - 2024-01-15T10:30:00.000Z"
1273
1274
// Access individual item results
1275
console.log(`Processed ${result.itemResults.length} items`);
1276
for (const itemResult of result.itemResults) {
1277
console.log(`Input: ${itemResult.input}`);
1278
console.log(`Output: ${itemResult.output}`);
1279
console.log(`Evaluations:`, itemResult.evaluations);
1280
}
1281
1282
// Access run-level evaluations
1283
console.log(`Run evaluations:`, result.runEvaluations);
1284
const avgAccuracy = result.runEvaluations.find(e => e.name === "average_accuracy");
1285
console.log(`Average accuracy: ${avgAccuracy?.value}`);
1286
1287
// Format results (summary only)
1288
const summary = await result.format();
1289
console.log(summary);
1290
/*
1291
Individual Results: Hidden (10 items)
1292
๐ก Call format({ includeItemResults: true }) to view them
1293
1294
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1295
๐งช Experiment: Test Experiment
1296
๐ Run name: Test Experiment - 2024-01-15T10:30:00.000Z
1297
10 items
1298
Evaluations:
1299
โข accuracy
1300
1301
Average Scores:
1302
โข accuracy: 0.850
1303
1304
Run Evaluations:
1305
โข average_accuracy: 0.850
1306
๐ญ Average accuracy: 85.0%
1307
*/
1308
1309
// Format with detailed results
1310
const detailed = await result.format({ includeItemResults: true });
1311
console.log(detailed);
1312
/*
1313
1. Item 1:
1314
Input: What is AI?
1315
Expected: Artificial Intelligence
1316
Actual: Artificial Intelligence
1317
Scores:
1318
โข accuracy: 1.000
1319
1320
Trace:
1321
https://cloud.langfuse.com/project/xxx/traces/abc123
1322
1323
2. Item 2:
1324
...
1325
1326
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1327
๐งช Experiment: Test Experiment
1328
...
1329
*/
1330
1331
// Access dataset run information (if applicable)
1332
if (result.datasetRunId) {
1333
console.log(`Dataset run ID: ${result.datasetRunId}`);
1334
console.log(`View in UI: ${result.datasetRunUrl}`);
1335
}
1336
1337
// Calculate custom metrics from results
1338
const successRate = result.itemResults.filter(r =>
1339
r.evaluations.some(e => e.name === "accuracy" && e.value === 1)
1340
).length / result.itemResults.length;
1341
console.log(`Success rate: ${(successRate * 100).toFixed(1)}%`);
1342
1343
// Export results for further analysis
1344
const exportData = result.itemResults.map(r => ({
1345
input: r.input,
1346
output: r.output,
1347
expectedOutput: r.expectedOutput,
1348
scores: Object.fromEntries(
1349
r.evaluations.map(e => [e.name, e.value])
1350
)
1351
}));
1352
await fs.writeFile('results.json', JSON.stringify(exportData, null, 2));
1353
```
1354
1355
### ExperimentItemResult
1356
1357
Result structure for individual item processing within an experiment.
1358
1359
```typescript { .api }
1360
/**
1361
* Result from processing one experiment item
1362
*
1363
* Contains the input, output, evaluations, and trace information
1364
* for a single data item.
1365
*/
1366
type ExperimentItemResult<
1367
Input = any,
1368
ExpectedOutput = any,
1369
Metadata extends Record<string, any> = Record<string, any>
1370
> = {
1371
/**
1372
* The original experiment or dataset item that was processed.
1373
*
1374
* Contains the complete original item data.
1375
*/
1376
item: ExperimentItem<Input, ExpectedOutput, Metadata>;
1377
1378
/**
1379
* The input data (extracted from item for convenience)
1380
*/
1381
input?: Input;
1382
1383
/**
1384
* The expected output (extracted from item for convenience)
1385
*/
1386
expectedOutput?: ExpectedOutput;
1387
1388
/**
1389
* The actual output produced by the task.
1390
*
1391
* This is the result returned by your task function for this specific input.
1392
*/
1393
output: any;
1394
1395
/**
1396
* Results from all evaluators that ran on this item.
1397
*
1398
* Contains evaluation scores, comments, and metadata from each evaluator.
1399
*/
1400
evaluations: Evaluation[];
1401
1402
/**
1403
* Langfuse trace ID for this item's execution.
1404
*
1405
* Use this ID to view detailed execution traces in the Langfuse UI.
1406
*/
1407
traceId?: string;
1408
1409
/**
1410
* Dataset run ID if this item was part of a Langfuse dataset.
1411
*
1412
* Links this item result to a specific dataset run.
1413
*/
1414
datasetRunId?: string;
1415
};
1416
```
1417
1418
**Usage Examples:**
1419
1420
```typescript
1421
import type { ExperimentItemResult } from '@langfuse/client';
1422
1423
// Process experiment results
1424
const result = await langfuse.experiment.run(config);
1425
1426
for (const itemResult: ExperimentItemResult of result.itemResults) {
1427
// Access item data
1428
console.log(`Processing item:`, itemResult.item);
1429
console.log(`Input:`, itemResult.input);
1430
console.log(`Expected:`, itemResult.expectedOutput);
1431
console.log(`Actual:`, itemResult.output);
1432
1433
// Access evaluations
1434
for (const evaluation of itemResult.evaluations) {
1435
console.log(`${evaluation.name}: ${evaluation.value}`);
1436
if (evaluation.comment) {
1437
console.log(` Comment: ${evaluation.comment}`);
1438
}
1439
}
1440
1441
// Access trace information
1442
if (itemResult.traceId) {
1443
const traceUrl = await langfuse.getTraceUrl(itemResult.traceId);
1444
console.log(`View trace: ${traceUrl}`);
1445
}
1446
1447
// Access dataset run information
1448
if (itemResult.datasetRunId) {
1449
console.log(`Dataset run ID: ${itemResult.datasetRunId}`);
1450
}
1451
}
1452
1453
// Filter failed items
1454
const failedItems = result.itemResults.filter(r =>
1455
r.evaluations.some(e => e.name === "accuracy" && e.value === 0)
1456
);
1457
console.log(`Failed items: ${failedItems.length}`);
1458
1459
// Group by score
1460
const highScoring = result.itemResults.filter(r =>
1461
r.evaluations.some(e => e.name === "accuracy" && (e.value as number) >= 0.8)
1462
);
1463
const lowScoring = result.itemResults.filter(r =>
1464
r.evaluations.some(e => e.name === "accuracy" && (e.value as number) < 0.5)
1465
);
1466
1467
// Analyze patterns
1468
const errorPatterns = failedItems.map(r => ({
1469
input: r.input,
1470
output: r.output,
1471
expected: r.expectedOutput
1472
}));
1473
console.log("Error patterns:", errorPatterns);
1474
```
1475
1476
## Integration with AutoEvals
1477
1478
Create Langfuse-compatible evaluators from AutoEvals library evaluators.
1479
1480
```typescript { .api }
1481
/**
1482
* Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function
1483
*
1484
* This adapter handles parameter mapping and result formatting automatically.
1485
* AutoEvals evaluators expect `input`, `output`, and `expected` parameters,
1486
* while Langfuse evaluators use `input`, `output`, and `expectedOutput`.
1487
*
1488
* @param autoevalEvaluator - The AutoEvals evaluator function to convert
1489
* @param params - Optional additional parameters to pass to the AutoEvals evaluator
1490
* @returns A Langfuse-compatible evaluator function
1491
*/
1492
function createEvaluatorFromAutoevals<E extends CallableFunction>(
1493
autoevalEvaluator: E,
1494
params?: Params<E>
1495
): Evaluator;
1496
```
1497
1498
**Usage Examples:**
1499
1500
```typescript
1501
import { Factuality, Levenshtein, ClosedQA } from 'autoevals';
1502
import { createEvaluatorFromAutoevals } from '@langfuse/client';
1503
1504
// Basic AutoEvals integration
1505
const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);
1506
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);
1507
1508
await langfuse.experiment.run({
1509
name: "AutoEvals Integration Test",
1510
data: myDataset,
1511
task: myTask,
1512
evaluators: [factualityEvaluator, levenshteinEvaluator]
1513
});
1514
1515
// With additional parameters
1516
const customFactualityEvaluator = createEvaluatorFromAutoevals(
1517
Factuality,
1518
{ model: 'gpt-4o' } // Additional params for AutoEvals
1519
);
1520
1521
await langfuse.experiment.run({
1522
name: "Factuality Test",
1523
data: testData,
1524
task: myTask,
1525
evaluators: [customFactualityEvaluator]
1526
});
1527
1528
// Multiple AutoEvals evaluators
1529
const closedQAEvaluator = createEvaluatorFromAutoevals(ClosedQA, {
1530
model: 'gpt-4',
1531
useCoT: true
1532
});
1533
1534
const comprehensiveEvaluators = [
1535
createEvaluatorFromAutoevals(Factuality),
1536
createEvaluatorFromAutoevals(Levenshtein),
1537
closedQAEvaluator
1538
];
1539
1540
await langfuse.experiment.run({
1541
name: "Comprehensive Evaluation",
1542
data: qaDataset,
1543
task: qaTask,
1544
evaluators: comprehensiveEvaluators
1545
});
1546
1547
// Mixing AutoEvals and custom evaluators
1548
await langfuse.experiment.run({
1549
name: "Mixed Evaluators",
1550
data: dataset,
1551
task: task,
1552
evaluators: [
1553
// AutoEvals evaluators
1554
createEvaluatorFromAutoevals(Factuality),
1555
createEvaluatorFromAutoevals(Levenshtein),
1556
// Custom evaluator
1557
async ({ output, expectedOutput }) => ({
1558
name: "exact_match",
1559
value: output === expectedOutput ? 1 : 0
1560
})
1561
]
1562
});
1563
```
1564
1565
## Advanced Usage
1566
1567
### Type Safety with Generics
1568
1569
Use TypeScript generics for full type safety across the experiment pipeline.
1570
1571
```typescript
1572
// Define your types
1573
interface QuestionInput {
1574
question: string;
1575
context: string[];
1576
}
1577
1578
interface AnswerOutput {
1579
answer: string;
1580
confidence: number;
1581
sources: string[];
1582
}
1583
1584
interface ItemMetadata {
1585
category: "science" | "history" | "literature";
1586
difficulty: number;
1587
tags: string[];
1588
}
1589
1590
// Type-safe experiment configuration
1591
const result = await langfuse.experiment.run<
1592
QuestionInput,
1593
AnswerOutput,
1594
ItemMetadata
1595
>({
1596
name: "Typed QA Experiment",
1597
data: [
1598
{
1599
input: {
1600
question: "What is photosynthesis?",
1601
context: ["Photosynthesis is the process..."]
1602
},
1603
expectedOutput: {
1604
answer: "A process where plants convert light to energy",
1605
confidence: 0.9,
1606
sources: ["biology textbook"]
1607
},
1608
metadata: {
1609
category: "science",
1610
difficulty: 5,
1611
tags: ["biology", "plants"]
1612
}
1613
}
1614
],
1615
task: async ({ input, metadata }) => {
1616
// input is typed as QuestionInput
1617
// metadata is typed as ItemMetadata
1618
const { question, context } = input;
1619
const difficulty = metadata?.difficulty || 5;
1620
1621
return await qaModel(question, context, difficulty);
1622
// Return type should match AnswerOutput
1623
},
1624
evaluators: [
1625
async ({ input, output, expectedOutput }) => {
1626
// All parameters are fully typed
1627
// input: QuestionInput
1628
// output: any (task output)
1629
// expectedOutput: AnswerOutput | undefined
1630
1631
return {
1632
name: "answer_quality",
1633
value: output.confidence
1634
};
1635
}
1636
]
1637
});
1638
1639
// Result is typed as ExperimentResult<QuestionInput, AnswerOutput, ItemMetadata>
1640
for (const itemResult of result.itemResults) {
1641
// itemResult.input is QuestionInput
1642
// itemResult.output is any
1643
// itemResult.expectedOutput is AnswerOutput | undefined
1644
console.log(itemResult.input.question);
1645
console.log(itemResult.expectedOutput?.confidence);
1646
}
1647
```
1648
1649
### Parallel vs Sequential Execution
1650
1651
Control experiment execution parallelism with maxConcurrency.
1652
1653
```typescript
1654
// Fully parallel (default)
1655
const parallelResult = await langfuse.experiment.run({
1656
name: "Parallel Execution",
1657
data: largeDataset,
1658
task: fastTask,
1659
evaluators: [evaluator]
1660
// maxConcurrency: Infinity (default)
1661
});
1662
1663
// Sequential execution
1664
const sequentialResult = await langfuse.experiment.run({
1665
name: "Sequential Execution",
1666
data: dataset,
1667
task: task,
1668
maxConcurrency: 1 // Process one item at a time
1669
});
1670
1671
// Controlled parallelism
1672
const controlledResult = await langfuse.experiment.run({
1673
name: "Rate Limited Execution",
1674
data: dataset,
1675
task: expensiveAPICall,
1676
maxConcurrency: 5 // Max 5 concurrent API calls
1677
});
1678
1679
// Batched processing
1680
const batchSize = 10;
1681
const batchedResult = await langfuse.experiment.run({
1682
name: "Batched Processing",
1683
data: veryLargeDataset,
1684
task: task,
1685
maxConcurrency: batchSize // Process in batches of 10
1686
});
1687
```
1688
1689
### Dataset Integration
1690
1691
Run experiments directly on Langfuse datasets with automatic linking.
1692
1693
```typescript
1694
// Get dataset
1695
const dataset = await langfuse.dataset.get("my-dataset");
1696
1697
// Run experiment on dataset (automatic data parameter)
1698
const result = await dataset.runExperiment({
1699
name: "GPT-4 Evaluation",
1700
task: async ({ input }) => {
1701
// Process dataset item
1702
return await model(input);
1703
},
1704
evaluators: [evaluator],
1705
runEvaluators: [averageEvaluator]
1706
});
1707
1708
// Results are automatically linked to dataset run
1709
console.log(`Dataset run ID: ${result.datasetRunId}`);
1710
console.log(`View in UI: ${result.datasetRunUrl}`);
1711
1712
// Each item result is linked
1713
for (const itemResult of result.itemResults) {
1714
console.log(`Dataset run ID: ${itemResult.datasetRunId}`);
1715
console.log(`Trace ID: ${itemResult.traceId}`);
1716
}
1717
1718
// Compare multiple runs on same dataset
1719
const run1 = await dataset.runExperiment({
1720
name: "Model A",
1721
runName: "model-a-run-1",
1722
task: modelA,
1723
evaluators: [evaluator]
1724
});
1725
1726
const run2 = await dataset.runExperiment({
1727
name: "Model B",
1728
runName: "model-b-run-1",
1729
task: modelB,
1730
evaluators: [evaluator]
1731
});
1732
1733
// Compare results
1734
console.log("Model A avg:", run1.runEvaluations[0].value);
1735
console.log("Model B avg:", run2.runEvaluations[0].value);
1736
```
1737
1738
### Result Formatting
1739
1740
Use the format() function to generate human-readable result summaries.
1741
1742
```typescript
1743
const result = await langfuse.experiment.run({
1744
name: "Test Experiment",
1745
data: testData,
1746
task: task,
1747
evaluators: [evaluator],
1748
runEvaluators: [runEvaluator]
1749
});
1750
1751
// Format summary (default)
1752
const summary = await result.format();
1753
console.log(summary);
1754
/*
1755
Individual Results: Hidden (50 items)
1756
๐ก Call format({ includeItemResults: true }) to view them
1757
1758
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1759
๐งช Experiment: Test Experiment
1760
๐ Run name: Test Experiment - 2024-01-15T10:30:00.000Z
1761
50 items
1762
Evaluations:
1763
โข accuracy
1764
โข f1_score
1765
1766
Average Scores:
1767
โข accuracy: 0.850
1768
โข f1_score: 0.823
1769
1770
Run Evaluations:
1771
โข average_accuracy: 0.850
1772
๐ญ Average accuracy: 85.0%
1773
โข precision: 0.875
1774
๐ญ Precision: 87.5%
1775
1776
๐ Dataset Run:
1777
https://cloud.langfuse.com/project/xxx/datasets/yyy/runs/zzz
1778
*/
1779
1780
// Format with detailed item results
1781
const detailed = await result.format({ includeItemResults: true });
1782
console.log(detailed);
1783
/*
1784
1. Item 1:
1785
Input: What is the capital of France?
1786
Expected: Paris
1787
Actual: Paris
1788
Scores:
1789
โข exact_match: 1.000
1790
โข similarity: 1.000
1791
1792
Dataset Item:
1793
https://cloud.langfuse.com/project/xxx/datasets/yyy/items/123
1794
1795
Trace:
1796
https://cloud.langfuse.com/project/xxx/traces/abc123
1797
1798
2. Item 2:
1799
Input: What is 2+2?
1800
Expected: 4
1801
Actual: 4
1802
Scores:
1803
โข exact_match: 1.000
1804
โข similarity: 1.000
1805
1806
Trace:
1807
https://cloud.langfuse.com/project/xxx/traces/def456
1808
1809
... (50 items total)
1810
1811
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1812
๐งช Experiment: Test Experiment
1813
... (summary as above)
1814
*/
1815
1816
// Save formatted results to file
1817
const formatted = await result.format({ includeItemResults: true });
1818
await fs.writeFile('experiment-results.txt', formatted);
1819
1820
// Use in CI/CD
1821
const summary = await result.format();
1822
console.log(summary);
1823
if (result.runEvaluations.some(e => e.name === "average_accuracy" && (e.value as number) < 0.8)) {
1824
throw new Error("Experiment failed: accuracy below threshold");
1825
}
1826
```
1827
1828
### Error Handling Strategies
1829
1830
Implement robust error handling for production experiments.
1831
1832
```typescript
1833
// Task with retry logic
1834
const resilientTask: ExperimentTask = async ({ input }) => {
1835
let lastError;
1836
for (let attempt = 0; attempt < 3; attempt++) {
1837
try {
1838
return await apiCall(input);
1839
} catch (error) {
1840
lastError = error;
1841
await new Promise(resolve => setTimeout(resolve, 1000 * (attempt + 1)));
1842
}
1843
}
1844
throw lastError;
1845
};
1846
1847
// Task with fallback
1848
const fallbackTask: ExperimentTask = async ({ input }) => {
1849
try {
1850
return await primaryModel(input);
1851
} catch (error) {
1852
console.warn("Primary model failed, using fallback");
1853
return await fallbackModel(input);
1854
}
1855
};
1856
1857
// Task with timeout
1858
const timeoutTask: ExperimentTask = async ({ input }) => {
1859
return await Promise.race([
1860
modelCall(input),
1861
new Promise((_, reject) =>
1862
setTimeout(() => reject(new Error("Timeout")), 30000)
1863
)
1864
]);
1865
};
1866
1867
// Evaluator with validation
1868
const validatingEvaluator: Evaluator = async ({ output, expectedOutput }) => {
1869
try {
1870
if (typeof output !== 'string' || typeof expectedOutput !== 'string') {
1871
throw new Error("Invalid output types");
1872
}
1873
1874
return {
1875
name: "accuracy",
1876
value: output === expectedOutput ? 1 : 0
1877
};
1878
} catch (error) {
1879
console.error("Evaluator validation failed:", error);
1880
return {
1881
name: "accuracy",
1882
value: 0,
1883
comment: `Validation error: ${error.message}`
1884
};
1885
}
1886
};
1887
1888
// Run experiment with error tracking
1889
const result = await langfuse.experiment.run({
1890
name: "Resilient Experiment",
1891
data: testData,
1892
task: resilientTask,
1893
evaluators: [validatingEvaluator]
1894
});
1895
1896
// Check for failures
1897
const successCount = result.itemResults.length;
1898
const totalCount = testData.length;
1899
const failureCount = totalCount - successCount;
1900
1901
if (failureCount > 0) {
1902
console.warn(`${failureCount} items failed during experiment`);
1903
}
1904
```
1905
1906
## Best Practices
1907
1908
### Experiment Organization
1909
1910
```typescript
1911
// โ Good: Descriptive naming
1912
await langfuse.experiment.run({
1913
name: "GPT-4 vs GPT-3.5 on QA Dataset",
1914
runName: "gpt-4-2024-01-15-temp-0.7",
1915
description: "Comparing model performance with temperature 0.7",
1916
metadata: {
1917
model_version: "gpt-4-0125-preview",
1918
temperature: 0.7,
1919
dataset_version: "v2.1"
1920
}
1921
});
1922
1923
// โ Bad: Generic naming
1924
await langfuse.experiment.run({
1925
name: "Test",
1926
data: data,
1927
task: task
1928
});
1929
```
1930
1931
### Evaluator Design
1932
1933
```typescript
1934
// โ Good: Multiple focused evaluators
1935
const evaluators = [
1936
// Simple binary check
1937
async ({ output, expectedOutput }) => ({
1938
name: "exact_match",
1939
value: output === expectedOutput ? 1 : 0
1940
}),
1941
// Similarity score
1942
async ({ output, expectedOutput }) => ({
1943
name: "cosine_similarity",
1944
value: calculateCosineSimilarity(output, expectedOutput)
1945
}),
1946
// Format validation
1947
async ({ output }) => ({
1948
name: "format_valid",
1949
value: validateFormat(output) ? 1 : 0
1950
})
1951
];
1952
1953
// โ Bad: One complex evaluator doing everything
1954
const badEvaluator = async ({ output, expectedOutput }) => ({
1955
name: "score",
1956
value: complexCalculation(output, expectedOutput)
1957
// Unclear what this represents
1958
});
1959
```
1960
1961
### Concurrency Management
1962
1963
```typescript
1964
// โ Good: Appropriate concurrency limits
1965
await langfuse.experiment.run({
1966
name: "Rate-Limited API Experiment",
1967
data: largeDataset,
1968
task: expensiveAPICall,
1969
maxConcurrency: 5, // Respect API rate limits
1970
evaluators: [evaluator]
1971
});
1972
1973
// โ Good: High concurrency for local operations
1974
await langfuse.experiment.run({
1975
name: "Local Model Experiment",
1976
data: dataset,
1977
task: localModelInference,
1978
maxConcurrency: 50, // Local model can handle high concurrency
1979
evaluators: [evaluator]
1980
});
1981
1982
// โ Bad: No concurrency control for rate-limited API
1983
await langfuse.experiment.run({
1984
name: "Uncontrolled Experiment",
1985
data: largeDataset,
1986
task: rateLimitedAPI
1987
// Will likely hit rate limits
1988
});
1989
```
1990
1991
### Type Safety
1992
1993
```typescript
1994
// โ Good: Explicit types
1995
interface Input {
1996
question: string;
1997
context: string;
1998
}
1999
2000
interface Output {
2001
answer: string;
2002
confidence: number;
2003
}
2004
2005
const result = await langfuse.experiment.run<Input, Output>({
2006
name: "Typed Experiment",
2007
data: [
2008
{
2009
input: { question: "...", context: "..." },
2010
expectedOutput: { answer: "...", confidence: 0.9 }
2011
}
2012
],
2013
task: async ({ input }) => {
2014
// input is typed as Input
2015
return await processTyped(input);
2016
}
2017
});
2018
2019
// โ Bad: Implicit any types
2020
const result = await langfuse.experiment.run({
2021
name: "Untyped Experiment",
2022
data: [{ input: someData }],
2023
task: async ({ input }) => {
2024
// input is any
2025
return await process(input);
2026
}
2027
});
2028
```
2029
2030
### Result Analysis
2031
2032
```typescript
2033
// โ Good: Use run evaluators for aggregates
2034
await langfuse.experiment.run({
2035
name: "Analysis Experiment",
2036
data: dataset,
2037
task: task,
2038
evaluators: [itemEvaluator],
2039
runEvaluators: [
2040
async ({ itemResults }) => {
2041
// Calculate aggregate metrics
2042
const avg = calculateAverage(itemResults);
2043
const stdDev = calculateStdDev(itemResults);
2044
2045
return [
2046
{ name: "average", value: avg },
2047
{ name: "std_dev", value: stdDev }
2048
];
2049
}
2050
]
2051
});
2052
2053
// โ Bad: Manual aggregation after experiment
2054
const result = await langfuse.experiment.run({
2055
name: "Manual Analysis",
2056
data: dataset,
2057
task: task,
2058
evaluators: [itemEvaluator]
2059
});
2060
2061
// Manually calculating aggregates (should use run evaluators)
2062
const scores = result.itemResults.map(r => r.evaluations[0].value);
2063
const avg = scores.reduce((a, b) => a + b) / scores.length;
2064
```
2065
2066
## Performance Considerations
2067
2068
### Batching and Concurrency
2069
2070
- Use `maxConcurrency` to control parallelism and avoid overwhelming external APIs
2071
- Default `maxConcurrency: Infinity` is suitable for local operations
2072
- Set `maxConcurrency: 1` for sequential processing when order matters
2073
- Typical values: 3-10 for API calls, 20-100 for local operations
2074
2075
### Memory Management
2076
2077
- Large datasets are processed in batches based on `maxConcurrency`
2078
- Each batch is processed completely before moving to the next
2079
- Failed items are logged and skipped, not stored in memory
2080
- Consider breaking very large experiments into multiple smaller runs
2081
2082
### Tracing Overhead
2083
2084
- OpenTelemetry tracing adds minimal overhead (~1-5ms per item)
2085
- Traces are sent asynchronously and don't block experiment execution
2086
- Disable tracing for maximum performance (though not recommended)
2087
- Use `flush()` to ensure all traces are sent before shutdown
2088
2089
### Evaluator Performance
2090
2091
- Item-level evaluators run in parallel with task execution
2092
- Failed evaluators don't block other evaluators
2093
- LLM-as-judge evaluators can be slow; use `maxConcurrency` to control them
2094
- Run-level evaluators execute sequentially after all items complete
2095