Tessl Tile for npm/@langfuse/client@4.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

autoevals-adapter.md client.md datasets.md experiments.md index.md media.md prompts.md scores.md

experiments.mddocs/

0
# Experiment Execution
1

2
The Experiment Execution system provides a comprehensive framework for running experiments that test models or tasks against datasets, with support for automatic evaluation, scoring, tracing, and result analysis. It enables systematic testing, comparison, and evaluation of AI models and prompts.
3

4
## Capabilities
5

6
### Run Experiment
7

8
Execute an experiment by running a task on each data item and evaluating the results with full tracing integration.
9

10
```typescript { .api }
11
/**
12
 * Executes an experiment by running a task on each data item and evaluating the results
13
 *
14
 * This method orchestrates the complete experiment lifecycle:
15
 * 1. Executes the task function on each data item with proper tracing
16
 * 2. Runs item-level evaluators on each task output
17
 * 3. Executes run-level evaluators on the complete result set
18
 * 4. Links results to dataset runs (for Langfuse datasets)
19
 * 5. Stores all scores and traces in Langfuse
20
 *
21
 * @param config - The experiment configuration
22
 * @returns Promise that resolves to experiment results including itemResults, runEvaluations, and format function
23
 */
24
run<Input = any, ExpectedOutput = any, Metadata extends Record<string, any> = Record<string, any>>(
25
  config: ExperimentParams<Input, ExpectedOutput, Metadata>
26
): Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>;
27
```
28

29
**Usage Examples:**
30

31
```typescript
32
import { LangfuseClient } from '@langfuse/client';
33
import OpenAI from 'openai';
34

35
const langfuse = new LangfuseClient();
36
const openai = new OpenAI();
37

38
// Basic experiment with custom data
39
const result = await langfuse.experiment.run({
40
  name: "Capital Cities Test",
41
  description: "Testing model knowledge of world capitals",
42
  data: [
43
    { input: "France", expectedOutput: "Paris" },
44
    { input: "Germany", expectedOutput: "Berlin" },
45
    { input: "Japan", expectedOutput: "Tokyo" }
46
  ],
47
  task: async ({ input }) => {
48
    const response = await openai.chat.completions.create({
49
      model: "gpt-4",
50
      messages: [{
51
        role: "user",
52
        content: `What is the capital of ${input}?`
53
      }]
54
    });
55
    return response.choices[0].message.content;
56
  },
57
  evaluators: [
58
    async ({ output, expectedOutput }) => ({
59
      name: "exact_match",
60
      value: output === expectedOutput ? 1 : 0
61
    })
62
  ]
63
});
64

65
console.log(await result.format());
66

67
// Experiment on Langfuse dataset
68
const dataset = await langfuse.dataset.get("qa-dataset");
69

70
const datasetResult = await dataset.runExperiment({
71
  name: "GPT-4 QA Evaluation",
72
  description: "Testing GPT-4 on our QA dataset",
73
  task: async ({ input }) => {
74
    const response = await openai.chat.completions.create({
75
      model: "gpt-4",
76
      messages: [{ role: "user", content: input }]
77
    });
78
    return response.choices[0].message.content;
79
  },
80
  evaluators: [
81
    async ({ output, expectedOutput }) => ({
82
      name: "accuracy",
83
      value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0,
84
      comment: output === expectedOutput ? "Perfect match" : "Case-insensitive match"
85
    })
86
  ]
87
});
88

89
// Multiple evaluators
90
const multiEvalResult = await langfuse.experiment.run({
91
  name: "Translation Quality Test",
92
  data: [
93
    { input: "Hello world", expectedOutput: "Hola mundo" },
94
    { input: "Good morning", expectedOutput: "Buenos días" }
95
  ],
96
  task: async ({ input }) => translateText(input, 'es'),
97
  evaluators: [
98
    // Evaluator 1: Exact match
99
    async ({ output, expectedOutput }) => ({
100
      name: "exact_match",
101
      value: output === expectedOutput ? 1 : 0
102
    }),
103
    // Evaluator 2: BLEU score
104
    async ({ output, expectedOutput }) => ({
105
      name: "bleu_score",
106
      value: calculateBleuScore(output, expectedOutput),
107
      comment: "Translation quality metric"
108
    }),
109
    // Evaluator 3: Length similarity
110
    async ({ output, expectedOutput }) => ({
111
      name: "length_similarity",
112
      value: Math.abs(output.length - expectedOutput.length) <= 5 ? 1 : 0
113
    })
114
  ]
115
});
116

117
// Run-level evaluators for aggregate analysis
118
const aggregateResult = await langfuse.experiment.run({
119
  name: "Sentiment Classification",
120
  data: sentimentDataset,
121
  task: classifysentiment,
122
  evaluators: [
123
    async ({ output, expectedOutput }) => ({
124
      name: "accuracy",
125
      value: output === expectedOutput ? 1 : 0
126
    })
127
  ],
128
  runEvaluators: [
129
    // Average accuracy across all items
130
    async ({ itemResults }) => {
131
      const accuracyScores = itemResults
132
        .flatMap(r => r.evaluations)
133
        .filter(e => e.name === "accuracy")
134
        .map(e => e.value as number);
135

136
      const average = accuracyScores.reduce((a, b) => a + b, 0) / accuracyScores.length;
137

138
      return {
139
        name: "average_accuracy",
140
        value: average,
141
        comment: `Overall accuracy: ${(average * 100).toFixed(1)}%`
142
      };
143
    },
144
    // Precision calculation
145
    async ({ itemResults }) => {
146
      let truePositives = 0;
147
      let falsePositives = 0;
148

149
      for (const result of itemResults) {
150
        if (result.output === "positive") {
151
          if (result.expectedOutput === "positive") {
152
            truePositives++;
153
          } else {
154
            falsePositives++;
155
          }
156
        }
157
      }
158

159
      const precision = truePositives / (truePositives + falsePositives);
160

161
      return {
162
        name: "precision",
163
        value: precision,
164
        comment: `Precision for positive class: ${(precision * 100).toFixed(1)}%`
165
      };
166
    }
167
  ]
168
});
169

170
// Concurrency control with maxConcurrency
171
const largeScaleResult = await langfuse.experiment.run({
172
  name: "Large Scale Evaluation",
173
  description: "Processing 1000 items with rate limiting",
174
  data: largeDataset,
175
  task: expensiveModelCall,
176
  maxConcurrency: 5, // Process max 5 items simultaneously
177
  evaluators: [accuracyEvaluator]
178
});
179

180
// Custom run name
181
const customRunResult = await langfuse.experiment.run({
182
  name: "Model Comparison",
183
  runName: "gpt-4-turbo-2024-01-15",
184
  description: "Testing latest GPT-4 Turbo model",
185
  data: testData,
186
  task: myTask,
187
  evaluators: [myEvaluator]
188
});
189

190
// With metadata
191
const metadataResult = await langfuse.experiment.run({
192
  name: "Parameter Sweep",
193
  metadata: {
194
    model: "gpt-4",
195
    temperature: 0.7,
196
    max_tokens: 1000,
197
    experiment_version: "v2.1"
198
  },
199
  data: testData,
200
  task: myTask,
201
  evaluators: [myEvaluator]
202
});
203

204
// Formatting results
205
const formattedResult = await langfuse.experiment.run({
206
  name: "Test Run",
207
  data: testData,
208
  task: myTask,
209
  evaluators: [myEvaluator]
210
});
211

212
// Format summary only (default)
213
console.log(await formattedResult.format());
214

215
// Format with detailed item results
216
console.log(await formattedResult.format({ includeItemResults: true }));
217

218
// Access raw results
219
console.log(`Processed ${formattedResult.itemResults.length} items`);
220
console.log(`Run evaluations:`, formattedResult.runEvaluations);
221
console.log(`Dataset run URL:`, formattedResult.datasetRunUrl);
222
```
223

224
**OpenTelemetry Integration:**
225

226
The experiment system automatically integrates with OpenTelemetry for distributed tracing:
227

228
```typescript
229
import { LangfuseClient } from '@langfuse/client';
230
import { LangfuseTraceClient } from '@langfuse/tracing';
231

232
// Ensure OpenTelemetry is configured
233
const langfuse = new LangfuseClient();
234

235
// Experiments automatically create traces for each task execution
236
const result = await langfuse.experiment.run({
237
  name: "Traced Experiment",
238
  data: testData,
239
  task: async ({ input }) => {
240
    // This task execution is automatically wrapped in a trace
241
    // with name "experiment-item-run"
242
    const output = await processInput(input);
243
    return output;
244
  },
245
  evaluators: [myEvaluator]
246
});
247

248
// Each item result includes trace information
249
for (const itemResult of result.itemResults) {
250
  console.log(`Trace ID: ${itemResult.traceId}`);
251
  const traceUrl = await langfuse.getTraceUrl(itemResult.traceId);
252
  console.log(`View trace: ${traceUrl}`);
253
}
254

255
// Warning if OpenTelemetry is not set up
256
// The system will log:
257
// "OpenTelemetry has not been set up. Traces will not be sent to Langfuse."
258
```
259

260
**Error Handling:**
261

262
```typescript
263
// Task errors are caught and logged
264
const resilientResult = await langfuse.experiment.run({
265
  name: "Resilient Experiment",
266
  data: testData,
267
  task: async ({ input }) => {
268
    try {
269
      return await riskyOperation(input);
270
    } catch (error) {
271
      // Task errors are caught, logged, and item is skipped
272
      throw error;
273
    }
274
  },
275
  evaluators: [
276
    async ({ output, expectedOutput }) => {
277
      try {
278
        return {
279
          name: "score",
280
          value: calculateScore(output, expectedOutput)
281
        };
282
      } catch (error) {
283
        // Evaluator errors are caught and logged
284
        // Other evaluators continue to run
285
        throw error;
286
      }
287
    }
288
  ]
289
});
290

291
// Result contains only successfully processed items
292
console.log(`Successfully processed: ${resilientResult.itemResults.length} items`);
293

294
// Run evaluators also handle errors gracefully
295
const robustResult = await langfuse.experiment.run({
296
  name: "Robust Experiment",
297
  data: testData,
298
  task: myTask,
299
  evaluators: [myEvaluator],
300
  runEvaluators: [
301
    async ({ itemResults }) => {
302
      try {
303
        return {
304
          name: "aggregate_metric",
305
          value: calculateAggregate(itemResults)
306
        };
307
      } catch (error) {
308
        // Run evaluator errors are caught and logged
309
        // Other run evaluators continue to run
310
        throw error;
311
      }
312
    }
313
  ]
314
});
315
```
316

317
## Type Definitions
318

319
### ExperimentParams
320

321
Configuration parameters for experiment execution.
322

323
```typescript { .api }
324
type ExperimentParams<
325
  Input = any,
326
  ExpectedOutput = any,
327
  Metadata extends Record<string, any> = Record<string, any>
328
> = {
329
  /**
330
   * Human-readable name for the experiment.
331
   *
332
   * This name will appear in Langfuse UI and experiment results.
333
   * Choose a descriptive name that identifies the experiment's purpose.
334
   */
335
  name: string;
336

337
  /**
338
   * Optional exact name for the experiment run.
339
   *
340
   * If provided, this will be used as the exact dataset run name if the data
341
   * contains Langfuse dataset items. If not provided, this will default to
342
   * the experiment name appended with an ISO timestamp.
343
   */
344
  runName?: string;
345

346
  /**
347
   * Optional description explaining the experiment's purpose.
348
   *
349
   * Provide context about what you're testing, methodology, or goals.
350
   * This helps with experiment tracking and result interpretation.
351
   */
352
  description?: string;
353

354
  /**
355
   * Optional metadata to attach to the experiment run.
356
   *
357
   * Store additional context like model versions, hyperparameters,
358
   * or any other relevant information for analysis and comparison.
359
   */
360
  metadata?: Record<string, any>;
361

362
  /**
363
   * Array of data items to process.
364
   *
365
   * Can be either custom ExperimentItem[] or DatasetItem[] from Langfuse.
366
   * Each item should contain input data and optionally expected output.
367
   */
368
  data: ExperimentItem<Input, ExpectedOutput, Metadata>[];
369

370
  /**
371
   * The task function to execute on each data item.
372
   *
373
   * This function receives input data and produces output that will be evaluated.
374
   * It should encapsulate the model or system being tested.
375
   */
376
  task: ExperimentTask<Input, ExpectedOutput, Metadata>;
377

378
  /**
379
   * Optional array of evaluator functions to assess each item's output.
380
   *
381
   * Each evaluator receives input, output, and expected output (if available)
382
   * and returns evaluation results. Multiple evaluators enable comprehensive assessment.
383
   */
384
  evaluators?: Evaluator<Input, ExpectedOutput, Metadata>[];
385

386
  /**
387
   * Optional array of run-level evaluators to assess the entire experiment.
388
   *
389
   * These evaluators receive all item results and can perform aggregate analysis
390
   * like calculating averages, detecting patterns, or statistical analysis.
391
   */
392
  runEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[];
393

394
  /**
395
   * Maximum number of concurrent task executions (default: Infinity).
396
   *
397
   * Controls parallelism to manage resource usage and API rate limits.
398
   * Set lower values for expensive operations or rate-limited services.
399
   */
400
  maxConcurrency?: number;
401
};
402
```
403

404
**Usage Examples:**
405

406
```typescript
407
import type { ExperimentParams } from '@langfuse/client';
408

409
// Type-safe experiment configuration
410
const config: ExperimentParams<string, string> = {
411
  name: "Capital Cities",
412
  description: "Testing geography knowledge",
413
  metadata: {
414
    model: "gpt-4",
415
    version: "v1.0"
416
  },
417
  data: [
418
    { input: "France", expectedOutput: "Paris" },
419
    { input: "Germany", expectedOutput: "Berlin" }
420
  ],
421
  task: async ({ input }) => {
422
    return await getCapital(input);
423
  },
424
  evaluators: [exactMatchEvaluator],
425
  runEvaluators: [averageScoreEvaluator],
426
  maxConcurrency: 3
427
};
428

429
await langfuse.experiment.run(config);
430

431
// Generic types for complex data
432
interface CustomInput {
433
  question: string;
434
  context: string[];
435
}
436

437
interface CustomOutput {
438
  answer: string;
439
  confidence: number;
440
}
441

442
interface CustomMetadata {
443
  category: string;
444
  difficulty: "easy" | "medium" | "hard";
445
}
446

447
const typedConfig: ExperimentParams<CustomInput, CustomOutput, CustomMetadata> = {
448
  name: "QA with Context",
449
  data: [
450
    {
451
      input: {
452
        question: "What is AI?",
453
        context: ["AI stands for Artificial Intelligence"]
454
      },
455
      expectedOutput: {
456
        answer: "Artificial Intelligence",
457
        confidence: 0.95
458
      },
459
      metadata: {
460
        category: "technology",
461
        difficulty: "easy"
462
      }
463
    }
464
  ],
465
  task: async ({ input }) => {
466
    // Type-safe input and output
467
    return await qaModel(input.question, input.context);
468
  },
469
  evaluators: [
470
    async ({ input, output, expectedOutput }) => {
471
      // All parameters are fully typed
472
      return {
473
        name: "accuracy",
474
        value: output.answer === expectedOutput?.answer ? 1 : 0
475
      };
476
    }
477
  ]
478
};
479
```
480

481
### ExperimentTask
482

483
Function type for experiment tasks that process input data and return output.
484

485
```typescript { .api }
486
/**
487
 * Function type for experiment tasks that process input data and return output
488
 *
489
 * The task function is the core component being tested in an experiment.
490
 * It receives either an ExperimentItem or DatasetItem and produces output
491
 * that will be evaluated.
492
 *
493
 * @param params - Either an ExperimentItem or DatasetItem containing input and metadata
494
 * @returns Promise resolving to the task's output (any type)
495
 */
496
type ExperimentTask<
497
  Input = any,
498
  ExpectedOutput = any,
499
  Metadata extends Record<string, any> = Record<string, any>
500
> = (
501
  params: ExperimentTaskParams<Input, ExpectedOutput, Metadata>
502
) => Promise<any>;
503

504
type ExperimentTaskParams<
505
  Input = any,
506
  ExpectedOutput = any,
507
  Metadata extends Record<string, any> = Record<string, any>
508
> = ExperimentItem<Input, ExpectedOutput, Metadata>;
509
```
510

511
**Usage Examples:**
512

513
```typescript
514
import type { ExperimentTask } from '@langfuse/client';
515

516
// Simple task function
517
const simpleTask: ExperimentTask = async ({ input }) => {
518
  return await processInput(input);
519
};
520

521
// Task with type safety
522
const typedTask: ExperimentTask<string, string> = async ({ input, metadata }) => {
523
  // input is typed as string
524
  // metadata is typed as Record<string, any>
525
  return await processString(input);
526
};
527

528
// Task accessing expected output (for reference)
529
const referenceTask: ExperimentTask = async ({ input, expectedOutput }) => {
530
  // Can access expectedOutput for context (but shouldn't use it for cheating!)
531
  console.log(`Processing input, expecting: ${expectedOutput}`);
532
  return await myModel(input);
533
};
534

535
// Task with custom types
536
interface QuestionInput {
537
  question: string;
538
  context: string;
539
}
540

541
const qaTask: ExperimentTask<QuestionInput, string> = async ({ input, metadata }) => {
542
  const { question, context } = input;
543
  return await answerQuestion(question, context);
544
};
545

546
// Task handling both ExperimentItem and DatasetItem
547
const universalTask: ExperimentTask = async (item) => {
548
  // Works with both types
549
  const input = item.input;
550
  const meta = item.metadata || {};
551

552
  // Check if it's a DatasetItem (has id and datasetId)
553
  if ('id' in item && 'datasetId' in item) {
554
    console.log(`Processing dataset item: ${item.id}`);
555
  }
556

557
  return await process(input, meta);
558
};
559

560
// Task with error handling
561
const robustTask: ExperimentTask = async ({ input }) => {
562
  try {
563
    return await riskyOperation(input);
564
  } catch (error) {
565
    console.error(`Task failed for input:`, input, error);
566
    throw error; // Re-throw to skip this item
567
  }
568
};
569

570
// Task with nested tracing
571
const tracedTask: ExperimentTask = async ({ input }) => {
572
  // Nested operations are automatically traced
573
  const step1 = await preprocessInput(input);
574
  const step2 = await modelInference(step1);
575
  const step3 = await postprocess(step2);
576
  return step3;
577
};
578
```
579

580
### ExperimentItem
581

582
Data item type for experiment inputs, supporting both custom items and Langfuse dataset items.
583

584
```typescript { .api }
585
/**
586
 * Experiment data item or dataset item
587
 *
588
 * Can be either a custom item with input/expectedOutput/metadata
589
 * or a DatasetItem from Langfuse
590
 */
591
type ExperimentItem<
592
  Input = any,
593
  ExpectedOutput = any,
594
  Metadata extends Record<string, any> = Record<string, any>
595
> =
596
  | {
597
      /**
598
       * The input data to pass to the task function.
599
       *
600
       * Can be any type - string, object, array, etc. This data will be passed
601
       * to your task function as the `input` parameter.
602
       */
603
      input?: Input;
604

605
      /**
606
       * The expected output for evaluation purposes.
607
       *
608
       * Optional ground truth or reference output for this input.
609
       * Used by evaluators to assess task performance.
610
       */
611
      expectedOutput?: ExpectedOutput;
612

613
      /**
614
       * Optional metadata to attach to the experiment item.
615
       *
616
       * Store additional context, tags, or custom data related to this specific item.
617
       * This metadata will be available in traces and evaluators.
618
       */
619
      metadata?: Metadata;
620
    }
621
  | DatasetItem;
622
```
623

624
**Usage Examples:**
625

626
```typescript
627
import type { ExperimentItem } from '@langfuse/client';
628

629
// Simple string items
630
const stringItems: ExperimentItem<string, string>[] = [
631
  { input: "Hello", expectedOutput: "Hola" },
632
  { input: "Goodbye", expectedOutput: "Adiós" }
633
];
634

635
// Complex structured items
636
interface QAInput {
637
  question: string;
638
  context: string;
639
}
640

641
const qaItems: ExperimentItem<QAInput, string>[] = [
642
  {
643
    input: {
644
      question: "What is AI?",
645
      context: "AI stands for Artificial Intelligence..."
646
    },
647
    expectedOutput: "Artificial Intelligence"
648
  }
649
];
650

651
// Items with metadata
652
const itemsWithMetadata: ExperimentItem<string, string, { category: string }>[] = [
653
  {
654
    input: "Test input",
655
    expectedOutput: "Expected output",
656
    metadata: {
657
      category: "technology"
658
    }
659
  }
660
];
661

662
// Items without expected output (evaluation-only based on output)
663
const noExpectedOutput: ExperimentItem<string, never>[] = [
664
  { input: "Generate creative text" }
665
  // No expectedOutput - evaluators won't have ground truth
666
];
667

668
// Mixed with Langfuse dataset items
669
const dataset = await langfuse.dataset.get("my-dataset");
670
const mixedItems: ExperimentItem[] = [
671
  // Custom items
672
  { input: "Custom input", expectedOutput: "Custom output" },
673
  // Dataset items
674
  ...dataset.items
675
];
676

677
// Accessing item properties in task
678
const task: ExperimentTask = async (item) => {
679
  if ('id' in item && 'datasetId' in item) {
680
    // It's a DatasetItem
681
    console.log(`Dataset item ID: ${item.id}`);
682
    console.log(`Dataset ID: ${item.datasetId}`);
683
  }
684

685
  return await process(item.input);
686
};
687
```
688

689
### Evaluator
690

691
Function type for item-level evaluators that assess individual task outputs.
692

693
```typescript { .api }
694
/**
695
 * Evaluator function for item-level evaluation
696
 *
697
 * Receives input, output, expected output, and metadata,
698
 * and returns evaluation results as Evaluation object(s).
699
 *
700
 * @param params - Parameters including input, output, expectedOutput, and metadata
701
 * @returns Promise resolving to single Evaluation or array of Evaluations
702
 */
703
type Evaluator<
704
  Input = any,
705
  ExpectedOutput = any,
706
  Metadata extends Record<string, any> = Record<string, any>
707
> = (
708
  params: EvaluatorParams<Input, ExpectedOutput, Metadata>
709
) => Promise<Evaluation[] | Evaluation>;
710

711
type EvaluatorParams<
712
  Input = any,
713
  ExpectedOutput = any,
714
  Metadata extends Record<string, any> = Record<string, any>
715
> = {
716
  /**
717
   * The original input data passed to the task.
718
   *
719
   * Use this for context-aware evaluations or input-output relationship analysis.
720
   */
721
  input: Input;
722

723
  /**
724
   * The output produced by the task.
725
   *
726
   * This is the actual result returned by your task function.
727
   */
728
  output: any;
729

730
  /**
731
   * The expected output for comparison (optional).
732
   *
733
   * This is the ground truth or expected result for the given input.
734
   */
735
  expectedOutput?: ExpectedOutput;
736
};
737
```
738

739
**Usage Examples:**
740

741
```typescript
742
import type { Evaluator, Evaluation } from '@langfuse/client';
743

744
// Simple exact match evaluator
745
const exactMatch: Evaluator = async ({ output, expectedOutput }) => ({
746
  name: "exact_match",
747
  value: output === expectedOutput ? 1 : 0
748
});
749

750
// Case-insensitive match with comment
751
const caseInsensitiveMatch: Evaluator = async ({ output, expectedOutput }) => {
752
  const match = output.toLowerCase() === expectedOutput?.toLowerCase();
753
  return {
754
    name: "case_insensitive_match",
755
    value: match ? 1 : 0,
756
    comment: match ? "Perfect match" : "No match"
757
  };
758
};
759

760
// Evaluator returning multiple scores
761
const comprehensiveEvaluator: Evaluator = async ({ output, expectedOutput }) => {
762
  return [
763
    {
764
      name: "exact_match",
765
      value: output === expectedOutput ? 1 : 0
766
    },
767
    {
768
      name: "length_match",
769
      value: Math.abs(output.length - expectedOutput.length) <= 5 ? 1 : 0
770
    },
771
    {
772
      name: "similarity",
773
      value: calculateSimilarity(output, expectedOutput),
774
      comment: "Cosine similarity score"
775
    }
776
  ];
777
};
778

779
// Type-safe evaluator
780
const typedEvaluator: Evaluator<string, string> = async ({ input, output, expectedOutput }) => {
781
  // All parameters are typed
782
  return {
783
    name: "accuracy",
784
    value: output === expectedOutput ? 1 : 0,
785
    metadata: { input_length: input.length }
786
  };
787
};
788

789
// Evaluator using input context
790
const contextAwareEvaluator: Evaluator = async ({ input, output }) => {
791
  const isValid = validateOutput(output, input);
792
  return {
793
    name: "validity",
794
    value: isValid ? 1 : 0,
795
    comment: isValid ? "Output valid for input" : "Output invalid"
796
  };
797
};
798

799
// Evaluator with metadata
800
const categoryEvaluator: Evaluator<any, any, { category: string }> = async ({
801
  output,
802
  expectedOutput,
803
  metadata
804
}) => {
805
  const score = calculateScore(output, expectedOutput);
806
  return {
807
    name: "category_score",
808
    value: score,
809
    metadata: {
810
      category: metadata?.category,
811
      timestamp: new Date().toISOString()
812
    }
813
  };
814
};
815

816
// Evaluator with different data types
817
const numericEvaluator: Evaluator = async ({ output, expectedOutput }) => {
818
  const error = Math.abs(output - expectedOutput);
819
  return {
820
    name: "absolute_error",
821
    value: error,
822
    dataType: "numeric",
823
    comment: `Error: ${error.toFixed(2)}`
824
  };
825
};
826

827
// Boolean evaluator
828
const booleanEvaluator: Evaluator = async ({ output }) => {
829
  return {
830
    name: "is_valid",
831
    value: validateFormat(output),
832
    dataType: "boolean"
833
  };
834
};
835

836
// Evaluator with error handling
837
const robustEvaluator: Evaluator = async ({ output, expectedOutput }) => {
838
  try {
839
    const score = complexCalculation(output, expectedOutput);
840
    return {
841
      name: "complex_score",
842
      value: score
843
    };
844
  } catch (error) {
845
    console.error("Evaluator failed:", error);
846
    throw error; // Will be caught and logged by experiment system
847
  }
848
};
849

850
// LLM-as-judge evaluator
851
const llmJudgeEvaluator: Evaluator = async ({ input, output, expectedOutput }) => {
852
  const judgmentPrompt = `
853
    Input: ${input}
854
    Expected: ${expectedOutput}
855
    Actual: ${output}
856

857
    Rate the quality from 0 to 1:
858
  `;
859

860
  const judgment = await llm.evaluate(judgmentPrompt);
861

862
  return {
863
    name: "llm_judgment",
864
    value: parseFloat(judgment),
865
    comment: `LLM evaluation of output quality`
866
  };
867
};
868
```
869

870
### RunEvaluator
871

872
Function type for run-level evaluators that assess the entire experiment.
873

874
```typescript { .api }
875
/**
876
 * Evaluator function for run-level evaluation
877
 *
878
 * Receives all item results and performs aggregate analysis
879
 * across the entire experiment run.
880
 *
881
 * @param params - Parameters including all itemResults
882
 * @returns Promise resolving to single Evaluation or array of Evaluations
883
 */
884
type RunEvaluator<
885
  Input = any,
886
  ExpectedOutput = any,
887
  Metadata extends Record<string, any> = Record<string, any>
888
> = (
889
  params: RunEvaluatorParams<Input, ExpectedOutput, Metadata>
890
) => Promise<Evaluation[] | Evaluation>;
891

892
type RunEvaluatorParams<
893
  Input = any,
894
  ExpectedOutput = any,
895
  Metadata extends Record<string, any> = Record<string, any>
896
> = {
897
  /**
898
   * Results from all processed experiment items.
899
   *
900
   * Each item contains the input, output, evaluations, and metadata from
901
   * processing a single data item. Use this for aggregate analysis,
902
   * statistical calculations, and cross-item comparisons.
903
   */
904
  itemResults: ExperimentItemResult<Input, ExpectedOutput, Metadata>[];
905
};
906
```
907

908
**Usage Examples:**
909

910
```typescript
911
import type { RunEvaluator } from '@langfuse/client';
912

913
// Average score evaluator
914
const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {
915
  const scores = itemResults
916
    .flatMap(r => r.evaluations)
917
    .filter(e => e.name === "accuracy")
918
    .map(e => e.value as number);
919

920
  const average = scores.reduce((a, b) => a + b, 0) / scores.length;
921

922
  return {
923
    name: "average_accuracy",
924
    value: average,
925
    comment: `Average accuracy: ${(average * 100).toFixed(1)}%`
926
  };
927
};
928

929
// Multiple run-level metrics
930
const comprehensiveRunEvaluator: RunEvaluator = async ({ itemResults }) => {
931
  const accuracyScores = itemResults
932
    .flatMap(r => r.evaluations)
933
    .filter(e => e.name === "accuracy")
934
    .map(e => e.value as number);
935

936
  const average = accuracyScores.reduce((a, b) => a + b, 0) / accuracyScores.length;
937
  const min = Math.min(...accuracyScores);
938
  const max = Math.max(...accuracyScores);
939
  const stdDev = calculateStdDev(accuracyScores);
940

941
  return [
942
    {
943
      name: "average_accuracy",
944
      value: average
945
    },
946
    {
947
      name: "min_accuracy",
948
      value: min
949
    },
950
    {
951
      name: "max_accuracy",
952
      value: max
953
    },
954
    {
955
      name: "std_dev_accuracy",
956
      value: stdDev,
957
      comment: "Standard deviation of accuracy scores"
958
    }
959
  ];
960
};
961

962
// Precision and recall
963
const precisionRecallEvaluator: RunEvaluator = async ({ itemResults }) => {
964
  let truePositives = 0;
965
  let falsePositives = 0;
966
  let falseNegatives = 0;
967

968
  for (const result of itemResults) {
969
    if (result.output === "positive") {
970
      if (result.expectedOutput === "positive") {
971
        truePositives++;
972
      } else {
973
        falsePositives++;
974
      }
975
    } else if (result.expectedOutput === "positive") {
976
      falseNegatives++;
977
    }
978
  }
979

980
  const precision = truePositives / (truePositives + falsePositives);
981
  const recall = truePositives / (truePositives + falseNegatives);
982
  const f1 = 2 * (precision * recall) / (precision + recall);
983

984
  return [
985
    {
986
      name: "precision",
987
      value: precision,
988
      comment: `Precision: ${(precision * 100).toFixed(1)}%`
989
    },
990
    {
991
      name: "recall",
992
      value: recall,
993
      comment: `Recall: ${(recall * 100).toFixed(1)}%`
994
    },
995
    {
996
      name: "f1_score",
997
      value: f1,
998
      comment: `F1 Score: ${(f1 * 100).toFixed(1)}%`
999
    }
1000
  ];
1001
};
1002

1003
// Category-based analysis
1004
const categoryAnalysisEvaluator: RunEvaluator<any, any, { category: string }> = async ({
1005
  itemResults
1006
}) => {
1007
  const categories = new Map<string, number[]>();
1008

1009
  for (const result of itemResults) {
1010
    const category = result.item.metadata?.category || "unknown";
1011
    const accuracy = result.evaluations.find(e => e.name === "accuracy")?.value as number;
1012

1013
    if (!categories.has(category)) {
1014
      categories.set(category, []);
1015
    }
1016
    categories.get(category)!.push(accuracy);
1017
  }
1018

1019
  const evaluations: Evaluation[] = [];
1020

1021
  for (const [category, scores] of categories) {
1022
    const average = scores.reduce((a, b) => a + b, 0) / scores.length;
1023
    evaluations.push({
1024
      name: `accuracy_${category}`,
1025
      value: average,
1026
      comment: `Average accuracy for ${category}: ${(average * 100).toFixed(1)}%`
1027
    });
1028
  }
1029

1030
  return evaluations;
1031
};
1032

1033
// Percentile analysis
1034
const percentileEvaluator: RunEvaluator = async ({ itemResults }) => {
1035
  const scores = itemResults
1036
    .flatMap(r => r.evaluations)
1037
    .filter(e => e.name === "score")
1038
    .map(e => e.value as number)
1039
    .sort((a, b) => a - b);
1040

1041
  const p50 = scores[Math.floor(scores.length * 0.5)];
1042
  const p90 = scores[Math.floor(scores.length * 0.9)];
1043
  const p95 = scores[Math.floor(scores.length * 0.95)];
1044

1045
  return [
1046
    { name: "p50_score", value: p50, comment: "Median score" },
1047
    { name: "p90_score", value: p90, comment: "90th percentile" },
1048
    { name: "p95_score", value: p95, comment: "95th percentile" }
1049
  ];
1050
};
1051

1052
// Failure analysis
1053
const failureAnalysisEvaluator: RunEvaluator = async ({ itemResults }) => {
1054
  const failures = itemResults.filter(r => {
1055
    const accuracy = r.evaluations.find(e => e.name === "accuracy")?.value;
1056
    return accuracy === 0;
1057
  });
1058

1059
  const failureRate = failures.length / itemResults.length;
1060

1061
  return {
1062
    name: "failure_rate",
1063
    value: failureRate,
1064
    comment: `${failures.length} of ${itemResults.length} items failed (${(failureRate * 100).toFixed(1)}%)`
1065
  };
1066
};
1067

1068
// Cross-item consistency
1069
const consistencyEvaluator: RunEvaluator = async ({ itemResults }) => {
1070
  // Check if similar inputs produce similar outputs
1071
  const consistency = analyzeConsistency(itemResults);
1072

1073
  return {
1074
    name: "consistency_score",
1075
    value: consistency,
1076
    comment: "Consistency across similar inputs"
1077
  };
1078
};
1079
```
1080

1081
### Evaluation
1082

1083
Result type for evaluations returned by evaluator functions.
1084

1085
```typescript { .api }
1086
/**
1087
 * Evaluation result from an evaluator
1088
 *
1089
 * Contains the score name, value, and optional metadata/comment
1090
 */
1091
type Evaluation = Pick<
1092
  ScoreBody,
1093
  "name" | "value" | "comment" | "metadata" | "dataType"
1094
>;
1095

1096
interface Evaluation {
1097
  /**
1098
   * Name of the evaluation metric
1099
   *
1100
   * Should be descriptive and unique within the evaluator set.
1101
   */
1102
  name: string;
1103

1104
  /**
1105
   * Numeric or boolean value of the evaluation
1106
   *
1107
   * Typically 0-1 for accuracy/similarity scores, but can be any numeric value.
1108
   */
1109
  value: number | boolean;
1110

1111
  /**
1112
   * Optional human-readable comment about the evaluation
1113
   *
1114
   * Useful for explaining the score or providing context.
1115
   */
1116
  comment?: string;
1117

1118
  /**
1119
   * Optional metadata about the evaluation
1120
   *
1121
   * Store additional context or debugging information.
1122
   */
1123
  metadata?: Record<string, any>;
1124

1125
  /**
1126
   * Optional data type specification
1127
   *
1128
   * Specifies how the value should be interpreted.
1129
   */
1130
  dataType?: "numeric" | "boolean" | "categorical";
1131
}
1132
```
1133

1134
**Usage Examples:**
1135

1136
```typescript
1137
import type { Evaluation } from '@langfuse/client';
1138

1139
// Simple numeric evaluation
1140
const simpleEval: Evaluation = {
1141
  name: "accuracy",
1142
  value: 0.85
1143
};
1144

1145
// Boolean evaluation
1146
const booleanEval: Evaluation = {
1147
  name: "passed",
1148
  value: true,
1149
  dataType: "boolean"
1150
};
1151

1152
// Evaluation with comment
1153
const commentedEval: Evaluation = {
1154
  name: "similarity",
1155
  value: 0.92,
1156
  comment: "High similarity between output and expected"
1157
};
1158

1159
// Evaluation with metadata
1160
const metadataEval: Evaluation = {
1161
  name: "response_quality",
1162
  value: 0.88,
1163
  metadata: {
1164
    model: "gpt-4",
1165
    temperature: 0.7,
1166
    tokens: 150
1167
  },
1168
  comment: "Quality assessment using LLM judge"
1169
};
1170

1171
// Multiple evaluation types
1172
const multiEval: Evaluation[] = [
1173
  {
1174
    name: "exact_match",
1175
    value: 1,
1176
    dataType: "boolean"
1177
  },
1178
  {
1179
    name: "similarity",
1180
    value: 0.95,
1181
    dataType: "numeric",
1182
    comment: "Cosine similarity"
1183
  },
1184
  {
1185
    name: "category",
1186
    value: 0,
1187
    dataType: "categorical",
1188
    metadata: { predicted: "A", actual: "B" }
1189
  }
1190
];
1191
```
1192

1193
### ExperimentResult
1194

1195
Complete result structure returned by the run() method.
1196

1197
```typescript { .api }
1198
/**
1199
 * Complete result of an experiment execution
1200
 *
1201
 * Contains all results from processing the experiment data,
1202
 * including individual item results, run-level evaluations,
1203
 * and utilities for result visualization.
1204
 */
1205
type ExperimentResult<
1206
  Input = any,
1207
  ExpectedOutput = any,
1208
  Metadata extends Record<string, any> = Record<string, any>
1209
> = {
1210
  /**
1211
   * The experiment run name.
1212
   *
1213
   * Either the provided runName parameter or generated name (experiment name + timestamp).
1214
   */
1215
  runName: string;
1216

1217
  /**
1218
   * ID of the dataset run in Langfuse (only for experiments on Langfuse datasets).
1219
   *
1220
   * Use this ID to access the dataset run via the Langfuse API or UI.
1221
   */
1222
  datasetRunId?: string;
1223

1224
  /**
1225
   * URL to the dataset run in the Langfuse UI (only for experiments on Langfuse datasets).
1226
   *
1227
   * Direct link to view the complete dataset run in the Langfuse web interface.
1228
   */
1229
  datasetRunUrl?: string;
1230

1231
  /**
1232
   * Results from processing each individual data item.
1233
   *
1234
   * Contains the complete results for every item in your experiment data.
1235
   */
1236
  itemResults: ExperimentItemResult<Input, ExpectedOutput, Metadata>[];
1237

1238
  /**
1239
   * Results from run-level evaluators that assessed the entire experiment.
1240
   *
1241
   * Contains aggregate evaluations that analyze the complete experiment.
1242
   */
1243
  runEvaluations: Evaluation[];
1244

1245
  /**
1246
   * Function to format experiment results in a human-readable format.
1247
   *
1248
   * @param options - Formatting options
1249
   * @param options.includeItemResults - Whether to include individual item details (default: false)
1250
   * @returns Promise resolving to formatted string representation
1251
   */
1252
  format: (options?: { includeItemResults?: boolean }) => Promise<string>;
1253
};
1254
```
1255

1256
**Usage Examples:**
1257

1258
```typescript
1259
import type { ExperimentResult } from '@langfuse/client';
1260

1261
// Run experiment and access results
1262
const result: ExperimentResult = await langfuse.experiment.run({
1263
  name: "Test Experiment",
1264
  data: testData,
1265
  task: myTask,
1266
  evaluators: [accuracyEvaluator],
1267
  runEvaluators: [averageEvaluator]
1268
});
1269

1270
// Access run name
1271
console.log(`Run name: ${result.runName}`);
1272
// "Test Experiment - 2024-01-15T10:30:00.000Z"
1273

1274
// Access individual item results
1275
console.log(`Processed ${result.itemResults.length} items`);
1276
for (const itemResult of result.itemResults) {
1277
  console.log(`Input: ${itemResult.input}`);
1278
  console.log(`Output: ${itemResult.output}`);
1279
  console.log(`Evaluations:`, itemResult.evaluations);
1280
}
1281

1282
// Access run-level evaluations
1283
console.log(`Run evaluations:`, result.runEvaluations);
1284
const avgAccuracy = result.runEvaluations.find(e => e.name === "average_accuracy");
1285
console.log(`Average accuracy: ${avgAccuracy?.value}`);
1286

1287
// Format results (summary only)
1288
const summary = await result.format();
1289
console.log(summary);
1290
/*
1291
Individual Results: Hidden (10 items)
1292
💡 Call format({ includeItemResults: true }) to view them
1293

1294
──────────────────────────────────────────────────
1295
🧪 Experiment: Test Experiment
1296
📋 Run name: Test Experiment - 2024-01-15T10:30:00.000Z
1297
10 items
1298
Evaluations:
1299
  • accuracy
1300

1301
Average Scores:
1302
  • accuracy: 0.850
1303

1304
Run Evaluations:
1305
  • average_accuracy: 0.850
1306
    💭 Average accuracy: 85.0%
1307
*/
1308

1309
// Format with detailed results
1310
const detailed = await result.format({ includeItemResults: true });
1311
console.log(detailed);
1312
/*
1313
1. Item 1:
1314
   Input:    What is AI?
1315
   Expected: Artificial Intelligence
1316
   Actual:   Artificial Intelligence
1317
   Scores:
1318
     • accuracy: 1.000
1319

1320
   Trace:
1321
   https://cloud.langfuse.com/project/xxx/traces/abc123
1322

1323
2. Item 2:
1324
   ...
1325

1326
──────────────────────────────────────────────────
1327
🧪 Experiment: Test Experiment
1328
...
1329
*/
1330

1331
// Access dataset run information (if applicable)
1332
if (result.datasetRunId) {
1333
  console.log(`Dataset run ID: ${result.datasetRunId}`);
1334
  console.log(`View in UI: ${result.datasetRunUrl}`);
1335
}
1336

1337
// Calculate custom metrics from results
1338
const successRate = result.itemResults.filter(r =>
1339
  r.evaluations.some(e => e.name === "accuracy" && e.value === 1)
1340
).length / result.itemResults.length;
1341
console.log(`Success rate: ${(successRate * 100).toFixed(1)}%`);
1342

1343
// Export results for further analysis
1344
const exportData = result.itemResults.map(r => ({
1345
  input: r.input,
1346
  output: r.output,
1347
  expectedOutput: r.expectedOutput,
1348
  scores: Object.fromEntries(
1349
    r.evaluations.map(e => [e.name, e.value])
1350
  )
1351
}));
1352
await fs.writeFile('results.json', JSON.stringify(exportData, null, 2));
1353
```
1354

1355
### ExperimentItemResult
1356

1357
Result structure for individual item processing within an experiment.
1358

1359
```typescript { .api }
1360
/**
1361
 * Result from processing one experiment item
1362
 *
1363
 * Contains the input, output, evaluations, and trace information
1364
 * for a single data item.
1365
 */
1366
type ExperimentItemResult<
1367
  Input = any,
1368
  ExpectedOutput = any,
1369
  Metadata extends Record<string, any> = Record<string, any>
1370
> = {
1371
  /**
1372
   * The original experiment or dataset item that was processed.
1373
   *
1374
   * Contains the complete original item data.
1375
   */
1376
  item: ExperimentItem<Input, ExpectedOutput, Metadata>;
1377

1378
  /**
1379
   * The input data (extracted from item for convenience)
1380
   */
1381
  input?: Input;
1382

1383
  /**
1384
   * The expected output (extracted from item for convenience)
1385
   */
1386
  expectedOutput?: ExpectedOutput;
1387

1388
  /**
1389
   * The actual output produced by the task.
1390
   *
1391
   * This is the result returned by your task function for this specific input.
1392
   */
1393
  output: any;
1394

1395
  /**
1396
   * Results from all evaluators that ran on this item.
1397
   *
1398
   * Contains evaluation scores, comments, and metadata from each evaluator.
1399
   */
1400
  evaluations: Evaluation[];
1401

1402
  /**
1403
   * Langfuse trace ID for this item's execution.
1404
   *
1405
   * Use this ID to view detailed execution traces in the Langfuse UI.
1406
   */
1407
  traceId?: string;
1408

1409
  /**
1410
   * Dataset run ID if this item was part of a Langfuse dataset.
1411
   *
1412
   * Links this item result to a specific dataset run.
1413
   */
1414
  datasetRunId?: string;
1415
};
1416
```
1417

1418
**Usage Examples:**
1419

1420
```typescript
1421
import type { ExperimentItemResult } from '@langfuse/client';
1422

1423
// Process experiment results
1424
const result = await langfuse.experiment.run(config);
1425

1426
for (const itemResult: ExperimentItemResult of result.itemResults) {
1427
  // Access item data
1428
  console.log(`Processing item:`, itemResult.item);
1429
  console.log(`Input:`, itemResult.input);
1430
  console.log(`Expected:`, itemResult.expectedOutput);
1431
  console.log(`Actual:`, itemResult.output);
1432

1433
  // Access evaluations
1434
  for (const evaluation of itemResult.evaluations) {
1435
    console.log(`${evaluation.name}: ${evaluation.value}`);
1436
    if (evaluation.comment) {
1437
      console.log(`  Comment: ${evaluation.comment}`);
1438
    }
1439
  }
1440

1441
  // Access trace information
1442
  if (itemResult.traceId) {
1443
    const traceUrl = await langfuse.getTraceUrl(itemResult.traceId);
1444
    console.log(`View trace: ${traceUrl}`);
1445
  }
1446

1447
  // Access dataset run information
1448
  if (itemResult.datasetRunId) {
1449
    console.log(`Dataset run ID: ${itemResult.datasetRunId}`);
1450
  }
1451
}
1452

1453
// Filter failed items
1454
const failedItems = result.itemResults.filter(r =>
1455
  r.evaluations.some(e => e.name === "accuracy" && e.value === 0)
1456
);
1457
console.log(`Failed items: ${failedItems.length}`);
1458

1459
// Group by score
1460
const highScoring = result.itemResults.filter(r =>
1461
  r.evaluations.some(e => e.name === "accuracy" && (e.value as number) >= 0.8)
1462
);
1463
const lowScoring = result.itemResults.filter(r =>
1464
  r.evaluations.some(e => e.name === "accuracy" && (e.value as number) < 0.5)
1465
);
1466

1467
// Analyze patterns
1468
const errorPatterns = failedItems.map(r => ({
1469
  input: r.input,
1470
  output: r.output,
1471
  expected: r.expectedOutput
1472
}));
1473
console.log("Error patterns:", errorPatterns);
1474
```
1475

1476
## Integration with AutoEvals
1477

1478
Create Langfuse-compatible evaluators from AutoEvals library evaluators.
1479

1480
```typescript { .api }
1481
/**
1482
 * Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function
1483
 *
1484
 * This adapter handles parameter mapping and result formatting automatically.
1485
 * AutoEvals evaluators expect `input`, `output`, and `expected` parameters,
1486
 * while Langfuse evaluators use `input`, `output`, and `expectedOutput`.
1487
 *
1488
 * @param autoevalEvaluator - The AutoEvals evaluator function to convert
1489
 * @param params - Optional additional parameters to pass to the AutoEvals evaluator
1490
 * @returns A Langfuse-compatible evaluator function
1491
 */
1492
function createEvaluatorFromAutoevals<E extends CallableFunction>(
1493
  autoevalEvaluator: E,
1494
  params?: Params<E>
1495
): Evaluator;
1496
```
1497

1498
**Usage Examples:**
1499

1500
```typescript
1501
import { Factuality, Levenshtein, ClosedQA } from 'autoevals';
1502
import { createEvaluatorFromAutoevals } from '@langfuse/client';
1503

1504
// Basic AutoEvals integration
1505
const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);
1506
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);
1507

1508
await langfuse.experiment.run({
1509
  name: "AutoEvals Integration Test",
1510
  data: myDataset,
1511
  task: myTask,
1512
  evaluators: [factualityEvaluator, levenshteinEvaluator]
1513
});
1514

1515
// With additional parameters
1516
const customFactualityEvaluator = createEvaluatorFromAutoevals(
1517
  Factuality,
1518
  { model: 'gpt-4o' } // Additional params for AutoEvals
1519
);
1520

1521
await langfuse.experiment.run({
1522
  name: "Factuality Test",
1523
  data: testData,
1524
  task: myTask,
1525
  evaluators: [customFactualityEvaluator]
1526
});
1527

1528
// Multiple AutoEvals evaluators
1529
const closedQAEvaluator = createEvaluatorFromAutoevals(ClosedQA, {
1530
  model: 'gpt-4',
1531
  useCoT: true
1532
});
1533

1534
const comprehensiveEvaluators = [
1535
  createEvaluatorFromAutoevals(Factuality),
1536
  createEvaluatorFromAutoevals(Levenshtein),
1537
  closedQAEvaluator
1538
];
1539

1540
await langfuse.experiment.run({
1541
  name: "Comprehensive Evaluation",
1542
  data: qaDataset,
1543
  task: qaTask,
1544
  evaluators: comprehensiveEvaluators
1545
});
1546

1547
// Mixing AutoEvals and custom evaluators
1548
await langfuse.experiment.run({
1549
  name: "Mixed Evaluators",
1550
  data: dataset,
1551
  task: task,
1552
  evaluators: [
1553
    // AutoEvals evaluators
1554
    createEvaluatorFromAutoevals(Factuality),
1555
    createEvaluatorFromAutoevals(Levenshtein),
1556
    // Custom evaluator
1557
    async ({ output, expectedOutput }) => ({
1558
      name: "exact_match",
1559
      value: output === expectedOutput ? 1 : 0
1560
    })
1561
  ]
1562
});
1563
```
1564

1565
## Advanced Usage
1566

1567
### Type Safety with Generics
1568

1569
Use TypeScript generics for full type safety across the experiment pipeline.
1570

1571
```typescript
1572
// Define your types
1573
interface QuestionInput {
1574
  question: string;
1575
  context: string[];
1576
}
1577

1578
interface AnswerOutput {
1579
  answer: string;
1580
  confidence: number;
1581
  sources: string[];
1582
}
1583

1584
interface ItemMetadata {
1585
  category: "science" | "history" | "literature";
1586
  difficulty: number;
1587
  tags: string[];
1588
}
1589

1590
// Type-safe experiment configuration
1591
const result = await langfuse.experiment.run<
1592
  QuestionInput,
1593
  AnswerOutput,
1594
  ItemMetadata
1595
>({
1596
  name: "Typed QA Experiment",
1597
  data: [
1598
    {
1599
      input: {
1600
        question: "What is photosynthesis?",
1601
        context: ["Photosynthesis is the process..."]
1602
      },
1603
      expectedOutput: {
1604
        answer: "A process where plants convert light to energy",
1605
        confidence: 0.9,
1606
        sources: ["biology textbook"]
1607
      },
1608
      metadata: {
1609
        category: "science",
1610
        difficulty: 5,
1611
        tags: ["biology", "plants"]
1612
      }
1613
    }
1614
  ],
1615
  task: async ({ input, metadata }) => {
1616
    // input is typed as QuestionInput
1617
    // metadata is typed as ItemMetadata
1618
    const { question, context } = input;
1619
    const difficulty = metadata?.difficulty || 5;
1620

1621
    return await qaModel(question, context, difficulty);
1622
    // Return type should match AnswerOutput
1623
  },
1624
  evaluators: [
1625
    async ({ input, output, expectedOutput }) => {
1626
      // All parameters are fully typed
1627
      // input: QuestionInput
1628
      // output: any (task output)
1629
      // expectedOutput: AnswerOutput | undefined
1630

1631
      return {
1632
        name: "answer_quality",
1633
        value: output.confidence
1634
      };
1635
    }
1636
  ]
1637
});
1638

1639
// Result is typed as ExperimentResult<QuestionInput, AnswerOutput, ItemMetadata>
1640
for (const itemResult of result.itemResults) {
1641
  // itemResult.input is QuestionInput
1642
  // itemResult.output is any
1643
  // itemResult.expectedOutput is AnswerOutput | undefined
1644
  console.log(itemResult.input.question);
1645
  console.log(itemResult.expectedOutput?.confidence);
1646
}
1647
```
1648

1649
### Parallel vs Sequential Execution
1650

1651
Control experiment execution parallelism with maxConcurrency.
1652

1653
```typescript
1654
// Fully parallel (default)
1655
const parallelResult = await langfuse.experiment.run({
1656
  name: "Parallel Execution",
1657
  data: largeDataset,
1658
  task: fastTask,
1659
  evaluators: [evaluator]
1660
  // maxConcurrency: Infinity (default)
1661
});
1662

1663
// Sequential execution
1664
const sequentialResult = await langfuse.experiment.run({
1665
  name: "Sequential Execution",
1666
  data: dataset,
1667
  task: task,
1668
  maxConcurrency: 1 // Process one item at a time
1669
});
1670

1671
// Controlled parallelism
1672
const controlledResult = await langfuse.experiment.run({
1673
  name: "Rate Limited Execution",
1674
  data: dataset,
1675
  task: expensiveAPICall,
1676
  maxConcurrency: 5 // Max 5 concurrent API calls
1677
});
1678

1679
// Batched processing
1680
const batchSize = 10;
1681
const batchedResult = await langfuse.experiment.run({
1682
  name: "Batched Processing",
1683
  data: veryLargeDataset,
1684
  task: task,
1685
  maxConcurrency: batchSize // Process in batches of 10
1686
});
1687
```
1688

1689
### Dataset Integration
1690

1691
Run experiments directly on Langfuse datasets with automatic linking.
1692

1693
```typescript
1694
// Get dataset
1695
const dataset = await langfuse.dataset.get("my-dataset");
1696

1697
// Run experiment on dataset (automatic data parameter)
1698
const result = await dataset.runExperiment({
1699
  name: "GPT-4 Evaluation",
1700
  task: async ({ input }) => {
1701
    // Process dataset item
1702
    return await model(input);
1703
  },
1704
  evaluators: [evaluator],
1705
  runEvaluators: [averageEvaluator]
1706
});
1707

1708
// Results are automatically linked to dataset run
1709
console.log(`Dataset run ID: ${result.datasetRunId}`);
1710
console.log(`View in UI: ${result.datasetRunUrl}`);
1711

1712
// Each item result is linked
1713
for (const itemResult of result.itemResults) {
1714
  console.log(`Dataset run ID: ${itemResult.datasetRunId}`);
1715
  console.log(`Trace ID: ${itemResult.traceId}`);
1716
}
1717

1718
// Compare multiple runs on same dataset
1719
const run1 = await dataset.runExperiment({
1720
  name: "Model A",
1721
  runName: "model-a-run-1",
1722
  task: modelA,
1723
  evaluators: [evaluator]
1724
});
1725

1726
const run2 = await dataset.runExperiment({
1727
  name: "Model B",
1728
  runName: "model-b-run-1",
1729
  task: modelB,
1730
  evaluators: [evaluator]
1731
});
1732

1733
// Compare results
1734
console.log("Model A avg:", run1.runEvaluations[0].value);
1735
console.log("Model B avg:", run2.runEvaluations[0].value);
1736
```
1737

1738
### Result Formatting
1739

1740
Use the format() function to generate human-readable result summaries.
1741

1742
```typescript
1743
const result = await langfuse.experiment.run({
1744
  name: "Test Experiment",
1745
  data: testData,
1746
  task: task,
1747
  evaluators: [evaluator],
1748
  runEvaluators: [runEvaluator]
1749
});
1750

1751
// Format summary (default)
1752
const summary = await result.format();
1753
console.log(summary);
1754
/*
1755
Individual Results: Hidden (50 items)
1756
💡 Call format({ includeItemResults: true }) to view them
1757

1758
──────────────────────────────────────────────────
1759
🧪 Experiment: Test Experiment
1760
📋 Run name: Test Experiment - 2024-01-15T10:30:00.000Z
1761
50 items
1762
Evaluations:
1763
  • accuracy
1764
  • f1_score
1765

1766
Average Scores:
1767
  • accuracy: 0.850
1768
  • f1_score: 0.823
1769

1770
Run Evaluations:
1771
  • average_accuracy: 0.850
1772
    💭 Average accuracy: 85.0%
1773
  • precision: 0.875
1774
    💭 Precision: 87.5%
1775

1776
🔗 Dataset Run:
1777
   https://cloud.langfuse.com/project/xxx/datasets/yyy/runs/zzz
1778
*/
1779

1780
// Format with detailed item results
1781
const detailed = await result.format({ includeItemResults: true });
1782
console.log(detailed);
1783
/*
1784
1. Item 1:
1785
   Input:    What is the capital of France?
1786
   Expected: Paris
1787
   Actual:   Paris
1788
   Scores:
1789
     • exact_match: 1.000
1790
     • similarity: 1.000
1791

1792
   Dataset Item:
1793
   https://cloud.langfuse.com/project/xxx/datasets/yyy/items/123
1794

1795
   Trace:
1796
   https://cloud.langfuse.com/project/xxx/traces/abc123
1797

1798
2. Item 2:
1799
   Input:    What is 2+2?
1800
   Expected: 4
1801
   Actual:   4
1802
   Scores:
1803
     • exact_match: 1.000
1804
     • similarity: 1.000
1805

1806
   Trace:
1807
   https://cloud.langfuse.com/project/xxx/traces/def456
1808

1809
... (50 items total)
1810

1811
──────────────────────────────────────────────────
1812
🧪 Experiment: Test Experiment
1813
... (summary as above)
1814
*/
1815

1816
// Save formatted results to file
1817
const formatted = await result.format({ includeItemResults: true });
1818
await fs.writeFile('experiment-results.txt', formatted);
1819

1820
// Use in CI/CD
1821
const summary = await result.format();
1822
console.log(summary);
1823
if (result.runEvaluations.some(e => e.name === "average_accuracy" && (e.value as number) < 0.8)) {
1824
  throw new Error("Experiment failed: accuracy below threshold");
1825
}
1826
```
1827

1828
### Error Handling Strategies
1829

1830
Implement robust error handling for production experiments.
1831

1832
```typescript
1833
// Task with retry logic
1834
const resilientTask: ExperimentTask = async ({ input }) => {
1835
  let lastError;
1836
  for (let attempt = 0; attempt < 3; attempt++) {
1837
    try {
1838
      return await apiCall(input);
1839
    } catch (error) {
1840
      lastError = error;
1841
      await new Promise(resolve => setTimeout(resolve, 1000 * (attempt + 1)));
1842
    }
1843
  }
1844
  throw lastError;
1845
};
1846

1847
// Task with fallback
1848
const fallbackTask: ExperimentTask = async ({ input }) => {
1849
  try {
1850
    return await primaryModel(input);
1851
  } catch (error) {
1852
    console.warn("Primary model failed, using fallback");
1853
    return await fallbackModel(input);
1854
  }
1855
};
1856

1857
// Task with timeout
1858
const timeoutTask: ExperimentTask = async ({ input }) => {
1859
  return await Promise.race([
1860
    modelCall(input),
1861
    new Promise((_, reject) =>
1862
      setTimeout(() => reject(new Error("Timeout")), 30000)
1863
    )
1864
  ]);
1865
};
1866

1867
// Evaluator with validation
1868
const validatingEvaluator: Evaluator = async ({ output, expectedOutput }) => {
1869
  try {
1870
    if (typeof output !== 'string' || typeof expectedOutput !== 'string') {
1871
      throw new Error("Invalid output types");
1872
    }
1873

1874
    return {
1875
      name: "accuracy",
1876
      value: output === expectedOutput ? 1 : 0
1877
    };
1878
  } catch (error) {
1879
    console.error("Evaluator validation failed:", error);
1880
    return {
1881
      name: "accuracy",
1882
      value: 0,
1883
      comment: `Validation error: ${error.message}`
1884
    };
1885
  }
1886
};
1887

1888
// Run experiment with error tracking
1889
const result = await langfuse.experiment.run({
1890
  name: "Resilient Experiment",
1891
  data: testData,
1892
  task: resilientTask,
1893
  evaluators: [validatingEvaluator]
1894
});
1895

1896
// Check for failures
1897
const successCount = result.itemResults.length;
1898
const totalCount = testData.length;
1899
const failureCount = totalCount - successCount;
1900

1901
if (failureCount > 0) {
1902
  console.warn(`${failureCount} items failed during experiment`);
1903
}
1904
```
1905

1906
## Best Practices
1907

1908
### Experiment Organization
1909

1910
```typescript
1911
// ✅ Good: Descriptive naming
1912
await langfuse.experiment.run({
1913
  name: "GPT-4 vs GPT-3.5 on QA Dataset",
1914
  runName: "gpt-4-2024-01-15-temp-0.7",
1915
  description: "Comparing model performance with temperature 0.7",
1916
  metadata: {
1917
    model_version: "gpt-4-0125-preview",
1918
    temperature: 0.7,
1919
    dataset_version: "v2.1"
1920
  }
1921
});
1922

1923
// ❌ Bad: Generic naming
1924
await langfuse.experiment.run({
1925
  name: "Test",
1926
  data: data,
1927
  task: task
1928
});
1929
```
1930

1931
### Evaluator Design
1932

1933
```typescript
1934
// ✅ Good: Multiple focused evaluators
1935
const evaluators = [
1936
  // Simple binary check
1937
  async ({ output, expectedOutput }) => ({
1938
    name: "exact_match",
1939
    value: output === expectedOutput ? 1 : 0
1940
  }),
1941
  // Similarity score
1942
  async ({ output, expectedOutput }) => ({
1943
    name: "cosine_similarity",
1944
    value: calculateCosineSimilarity(output, expectedOutput)
1945
  }),
1946
  // Format validation
1947
  async ({ output }) => ({
1948
    name: "format_valid",
1949
    value: validateFormat(output) ? 1 : 0
1950
  })
1951
];
1952

1953
// ❌ Bad: One complex evaluator doing everything
1954
const badEvaluator = async ({ output, expectedOutput }) => ({
1955
  name: "score",
1956
  value: complexCalculation(output, expectedOutput)
1957
  // Unclear what this represents
1958
});
1959
```
1960

1961
### Concurrency Management
1962

1963
```typescript
1964
// ✅ Good: Appropriate concurrency limits
1965
await langfuse.experiment.run({
1966
  name: "Rate-Limited API Experiment",
1967
  data: largeDataset,
1968
  task: expensiveAPICall,
1969
  maxConcurrency: 5, // Respect API rate limits
1970
  evaluators: [evaluator]
1971
});
1972

1973
// ✅ Good: High concurrency for local operations
1974
await langfuse.experiment.run({
1975
  name: "Local Model Experiment",
1976
  data: dataset,
1977
  task: localModelInference,
1978
  maxConcurrency: 50, // Local model can handle high concurrency
1979
  evaluators: [evaluator]
1980
});
1981

1982
// ❌ Bad: No concurrency control for rate-limited API
1983
await langfuse.experiment.run({
1984
  name: "Uncontrolled Experiment",
1985
  data: largeDataset,
1986
  task: rateLimitedAPI
1987
  // Will likely hit rate limits
1988
});
1989
```
1990

1991
### Type Safety
1992

1993
```typescript
1994
// ✅ Good: Explicit types
1995
interface Input {
1996
  question: string;
1997
  context: string;
1998
}
1999

2000
interface Output {
2001
  answer: string;
2002
  confidence: number;
2003
}
2004

2005
const result = await langfuse.experiment.run<Input, Output>({
2006
  name: "Typed Experiment",
2007
  data: [
2008
    {
2009
      input: { question: "...", context: "..." },
2010
      expectedOutput: { answer: "...", confidence: 0.9 }
2011
    }
2012
  ],
2013
  task: async ({ input }) => {
2014
    // input is typed as Input
2015
    return await processTyped(input);
2016
  }
2017
});
2018

2019
// ❌ Bad: Implicit any types
2020
const result = await langfuse.experiment.run({
2021
  name: "Untyped Experiment",
2022
  data: [{ input: someData }],
2023
  task: async ({ input }) => {
2024
    // input is any
2025
    return await process(input);
2026
  }
2027
});
2028
```
2029

2030
### Result Analysis
2031

2032
```typescript
2033
// ✅ Good: Use run evaluators for aggregates
2034
await langfuse.experiment.run({
2035
  name: "Analysis Experiment",
2036
  data: dataset,
2037
  task: task,
2038
  evaluators: [itemEvaluator],
2039
  runEvaluators: [
2040
    async ({ itemResults }) => {
2041
      // Calculate aggregate metrics
2042
      const avg = calculateAverage(itemResults);
2043
      const stdDev = calculateStdDev(itemResults);
2044

2045
      return [
2046
        { name: "average", value: avg },
2047
        { name: "std_dev", value: stdDev }
2048
      ];
2049
    }
2050
  ]
2051
});
2052

2053
// ❌ Bad: Manual aggregation after experiment
2054
const result = await langfuse.experiment.run({
2055
  name: "Manual Analysis",
2056
  data: dataset,
2057
  task: task,
2058
  evaluators: [itemEvaluator]
2059
});
2060

2061
// Manually calculating aggregates (should use run evaluators)
2062
const scores = result.itemResults.map(r => r.evaluations[0].value);
2063
const avg = scores.reduce((a, b) => a + b) / scores.length;
2064
```
2065

2066
## Performance Considerations
2067

2068
### Batching and Concurrency
2069

2070
- Use `maxConcurrency` to control parallelism and avoid overwhelming external APIs
2071
- Default `maxConcurrency: Infinity` is suitable for local operations
2072
- Set `maxConcurrency: 1` for sequential processing when order matters
2073
- Typical values: 3-10 for API calls, 20-100 for local operations
2074

2075
### Memory Management
2076

2077
- Large datasets are processed in batches based on `maxConcurrency`
2078
- Each batch is processed completely before moving to the next
2079
- Failed items are logged and skipped, not stored in memory
2080
- Consider breaking very large experiments into multiple smaller runs
2081

2082
### Tracing Overhead
2083

2084
- OpenTelemetry tracing adds minimal overhead (~1-5ms per item)
2085
- Traces are sent asynchronously and don't block experiment execution
2086
- Disable tracing for maximum performance (though not recommended)
2087
- Use `flush()` to ensure all traces are sent before shutdown
2088

2089
### Evaluator Performance
2090

2091
- Item-level evaluators run in parallel with task execution
2092
- Failed evaluators don't block other evaluators
2093
- LLM-as-judge evaluators can be slow; use `maxConcurrency` to control them
2094
- Run-level evaluators execute sequentially after all items complete
2095

Version

Tile

Files

experiments.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

experiments.mddocs/