Tessl Tile for npm/@langfuse/client@4.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

autoevals-adapter.md client.md datasets.md experiments.md index.md media.md prompts.md scores.md

autoevals-adapter.mddocs/

0
# AutoEvals Integration
1

2
The AutoEvals Integration provides a seamless adapter for using evaluators from the [AutoEvals library](https://github.com/braintrustdata/autoevals) with Langfuse experiments. This adapter handles parameter mapping and result formatting automatically, allowing you to leverage battle-tested evaluation metrics without writing custom evaluation code.
3

4
## Capabilities
5

6
### createEvaluatorFromAutoevals
7

8
Convert AutoEvals evaluators to Langfuse-compatible evaluator functions with automatic parameter mapping.
9

10
```typescript { .api }
11
/**
12
 * Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function
13
 *
14
 * This adapter function bridges the gap between AutoEvals library evaluators
15
 * and Langfuse experiment evaluators, handling parameter mapping and result
16
 * formatting automatically.
17
 *
18
 * AutoEvals evaluators expect `input`, `output`, and `expected` parameters,
19
 * while Langfuse evaluators use `input`, `output`, and `expectedOutput`.
20
 * This function handles the parameter name mapping transparently.
21
 *
22
 * The adapter also transforms AutoEvals result format (with `name`, `score`,
23
 * and `metadata` fields) to Langfuse evaluation format (with `name`, `value`,
24
 * and `metadata` fields).
25
 *
26
 * @template E - Type of the AutoEvals evaluator function
27
 * @param autoevalEvaluator - The AutoEvals evaluator function to convert
28
 * @param params - Optional additional parameters to pass to the AutoEvals evaluator
29
 * @returns A Langfuse-compatible evaluator function
30
 */
31
function createEvaluatorFromAutoevals<E extends CallableFunction>(
32
  autoevalEvaluator: E,
33
  params?: Params<E>
34
): Evaluator;
35

36
/**
37
 * Utility type to extract parameter types from AutoEvals evaluator functions
38
 *
39
 * This type helper extracts the parameter type from an AutoEvals evaluator
40
 * and omits the standard parameters (input, output, expected) that are
41
 * handled by the adapter, leaving only the additional configuration parameters.
42
 *
43
 * @template E - The AutoEvals evaluator function type
44
 */
45
type Params<E> = Parameters<
46
  E extends (...args: any[]) => any ? E : never
47
>[0] extends infer P
48
  ? Omit<P, "input" | "output" | "expected">
49
  : never;
50
```
51

52
## Parameter Mapping
53

54
The adapter automatically handles the parameter name differences between AutoEvals and Langfuse:
55

56
| AutoEvals Parameter | Langfuse Parameter | Description |
57
|---------------------|-------------------|-------------|
58
| `input` | `input` | The input data passed to the task |
59
| `output` | `output` | The output produced by the task |
60
| `expected` | `expectedOutput` | The expected/ground truth output |
61
62
Additional parameters specified in the `params` argument are passed through to the AutoEvals evaluator without modification.
63

64
## Result Transformation
65

66
The adapter transforms AutoEvals results to Langfuse evaluation format:
67

68
```typescript
69
// AutoEvals result format
70
{
71
  name: string;
72
  score: number;
73
  metadata?: Record<string, any>;
74
}
75

76
// Transformed to Langfuse format
77
{
78
  name: string;
79
  value: number;  // mapped from score, defaults to 0 if undefined
80
  metadata?: Record<string, any>;
81
}
82
```
83

84
## Usage Examples
85

86
### Basic Usage
87

88
Use AutoEvals evaluators directly with Langfuse experiments:
89

90
```typescript
91
import { Factuality, Levenshtein } from 'autoevals';
92
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
93

94
const langfuse = new LangfuseClient();
95

96
// Create wrapped evaluators
97
const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);
98
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);
99

100
// Use in experiment
101
const result = await langfuse.experiment.run({
102
  name: "Capital Cities Test",
103
  data: [
104
    { input: "France", expectedOutput: "Paris" },
105
    { input: "Germany", expectedOutput: "Berlin" },
106
    { input: "Japan", expectedOutput: "Tokyo" }
107
  ],
108
  task: async ({ input }) => {
109
    const response = await openai.chat.completions.create({
110
      model: "gpt-4",
111
      messages: [{
112
        role: "user",
113
        content: `What is the capital of ${input}?`
114
      }]
115
    });
116
    return response.choices[0].message.content;
117
  },
118
  evaluators: [factualityEvaluator, levenshteinEvaluator]
119
});
120

121
console.log(await result.format());
122
```
123

124
### With Additional Parameters
125

126
Pass configuration parameters to AutoEvals evaluators:
127

128
```typescript
129
import { Factuality, ClosedQA, Battle } from 'autoevals';
130
import { createEvaluatorFromAutoevals } from '@langfuse/client';
131

132
// Configure Factuality evaluator with custom model
133
const factualityEvaluator = createEvaluatorFromAutoevals(
134
  Factuality,
135
  { model: 'gpt-4o' }
136
);
137

138
// Configure ClosedQA with model and chain-of-thought
139
const closedQAEvaluator = createEvaluatorFromAutoevals(
140
  ClosedQA,
141
  {
142
    model: 'gpt-4-turbo',
143
    useCoT: true  // Enable chain of thought reasoning
144
  }
145
);
146

147
// Configure Battle evaluator for model comparison
148
const battleEvaluator = createEvaluatorFromAutoevals(
149
  Battle,
150
  {
151
    model: 'gpt-4',
152
    instructions: 'Compare which response is more accurate and helpful'
153
  }
154
);
155

156
await langfuse.experiment.run({
157
  name: "Configured Evaluators Test",
158
  data: qaDataset,
159
  task: myTask,
160
  evaluators: [
161
    factualityEvaluator,
162
    closedQAEvaluator,
163
    battleEvaluator
164
  ]
165
});
166
```
167

168
### Common AutoEvals Evaluators
169

170
Examples using popular AutoEvals evaluators:
171

172
```typescript
173
import {
174
  Factuality,
175
  Levenshtein,
176
  ClosedQA,
177
  Battle,
178
  Humor,
179
  Security,
180
  Sql,
181
  ValidJson,
182
  AnswerRelevancy
183
} from 'autoevals';
184
import { createEvaluatorFromAutoevals } from '@langfuse/client';
185

186
// Text similarity and accuracy
187
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);
188

189
// Factuality checking (requires OpenAI)
190
const factualityEvaluator = createEvaluatorFromAutoevals(
191
  Factuality,
192
  { model: 'gpt-4o' }
193
);
194

195
// Closed-domain QA evaluation
196
const closedQAEvaluator = createEvaluatorFromAutoevals(
197
  ClosedQA,
198
  { model: 'gpt-4o' }
199
);
200

201
// Model comparison
202
const battleEvaluator = createEvaluatorFromAutoevals(
203
  Battle,
204
  { model: 'gpt-4' }
205
);
206

207
// Humor detection
208
const humorEvaluator = createEvaluatorFromAutoevals(
209
  Humor,
210
  { model: 'gpt-4o' }
211
);
212

213
// Security checking
214
const securityEvaluator = createEvaluatorFromAutoevals(
215
  Security,
216
  { model: 'gpt-4o' }
217
);
218

219
// SQL validation
220
const sqlEvaluator = createEvaluatorFromAutoevals(Sql);
221

222
// JSON validation
223
const jsonEvaluator = createEvaluatorFromAutoevals(ValidJson);
224

225
// Answer relevancy
226
const relevancyEvaluator = createEvaluatorFromAutoevals(
227
  AnswerRelevancy,
228
  { model: 'gpt-4o' }
229
);
230

231
// Use multiple evaluators for comprehensive assessment
232
await langfuse.experiment.run({
233
  name: "Comprehensive QA Evaluation",
234
  data: qaDataset,
235
  task: qaTask,
236
  evaluators: [
237
    levenshteinEvaluator,
238
    factualityEvaluator,
239
    closedQAEvaluator,
240
    relevancyEvaluator
241
  ]
242
});
243
```
244

245
### With Langfuse Datasets
246

247
Use AutoEvals evaluators when running experiments on Langfuse datasets:
248

249
```typescript
250
import { Factuality, Levenshtein } from 'autoevals';
251
import { createEvaluatorFromAutoevals } from '@langfuse/client';
252

253
const langfuse = new LangfuseClient();
254

255
// Fetch dataset from Langfuse
256
const dataset = await langfuse.dataset.get("qa-evaluation-dataset");
257

258
// Run experiment with AutoEvals evaluators
259
const result = await dataset.runExperiment({
260
  name: "GPT-4 QA Evaluation",
261
  description: "Evaluating GPT-4 performance on QA dataset",
262
  task: async ({ input }) => {
263
    const response = await openai.chat.completions.create({
264
      model: "gpt-4",
265
      messages: [{ role: "user", content: input }]
266
    });
267
    return response.choices[0].message.content;
268
  },
269
  evaluators: [
270
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
271
    createEvaluatorFromAutoevals(Levenshtein)
272
  ]
273
});
274

275
console.log(`Dataset Run URL: ${result.datasetRunUrl}`);
276
console.log(await result.format());
277
```
278

279
### Combining AutoEvals and Custom Evaluators
280

281
Mix AutoEvals evaluators with your own custom evaluation logic:
282

283
```typescript
284
import { Factuality, Levenshtein } from 'autoevals';
285
import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';
286

287
// Custom evaluator
288
const exactMatchEvaluator: Evaluator = async ({ output, expectedOutput }) => ({
289
  name: "exact_match",
290
  value: output === expectedOutput ? 1 : 0,
291
  comment: output === expectedOutput ? "Perfect match" : "No match"
292
});
293

294
// Custom evaluator with metadata
295
const lengthEvaluator: Evaluator = async ({ output, expectedOutput }) => {
296
  const outputLen = output?.length || 0;
297
  const expectedLen = expectedOutput?.length || 0;
298
  const lengthDiff = Math.abs(outputLen - expectedLen);
299

300
  return {
301
    name: "length_similarity",
302
    value: 1 - (lengthDiff / Math.max(outputLen, expectedLen, 1)),
303
    metadata: {
304
      outputLength: outputLen,
305
      expectedLength: expectedLen,
306
      difference: lengthDiff
307
    }
308
  };
309
};
310

311
// Custom multi-evaluation evaluator
312
const comprehensiveCustomEvaluator: Evaluator = async ({
313
  input,
314
  output,
315
  expectedOutput
316
}) => {
317
  return [
318
    {
319
      name: "contains_expected",
320
      value: output.includes(expectedOutput) ? 1 : 0
321
    },
322
    {
323
      name: "case_sensitive_match",
324
      value: output === expectedOutput ? 1 : 0
325
    },
326
    {
327
      name: "case_insensitive_match",
328
      value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0
329
    }
330
  ];
331
};
332

333
// Combine everything
334
await langfuse.experiment.run({
335
  name: "Mixed Evaluators Experiment",
336
  data: dataset,
337
  task: myTask,
338
  evaluators: [
339
    // AutoEvals evaluators
340
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
341
    createEvaluatorFromAutoevals(Levenshtein),
342
    // Custom evaluators
343
    exactMatchEvaluator,
344
    lengthEvaluator,
345
    comprehensiveCustomEvaluator
346
  ]
347
});
348
```
349

350
### Advanced: Domain-Specific Evaluations
351

352
Configure AutoEvals evaluators for specific domains:
353

354
```typescript
355
import { Factuality, ClosedQA, Security } from 'autoevals';
356
import { createEvaluatorFromAutoevals } from '@langfuse/client';
357

358
// Medical QA evaluation
359
const medicalQAEvaluators = [
360
  createEvaluatorFromAutoevals(Factuality, {
361
    model: 'gpt-4o',
362
    // Additional context can be provided through metadata
363
  }),
364
  createEvaluatorFromAutoevals(ClosedQA, {
365
    model: 'gpt-4-turbo',
366
    useCoT: true
367
  })
368
];
369

370
await langfuse.experiment.run({
371
  name: "Medical QA Evaluation",
372
  description: "Evaluating medical question answering accuracy",
373
  data: medicalQADataset,
374
  task: medicalQATask,
375
  evaluators: medicalQAEvaluators
376
});
377

378
// Code generation evaluation
379
const codeGenerationEvaluators = [
380
  createEvaluatorFromAutoevals(Security, { model: 'gpt-4o' }),
381
  createEvaluatorFromAutoevals(ValidJson), // If generating JSON
382
  createEvaluatorFromAutoevals(Sql) // If generating SQL
383
];
384

385
await langfuse.experiment.run({
386
  name: "Code Generation Quality",
387
  description: "Evaluating generated code for security and validity",
388
  data: codeGenDataset,
389
  task: codeGenTask,
390
  evaluators: codeGenerationEvaluators
391
});
392

393
// Creative writing evaluation
394
const creativeWritingEvaluators = [
395
  createEvaluatorFromAutoevals(Humor, { model: 'gpt-4o' }),
396
  createEvaluatorFromAutoevals(AnswerRelevancy, { model: 'gpt-4o' })
397
];
398

399
await langfuse.experiment.run({
400
  name: "Creative Writing Assessment",
401
  description: "Evaluating creative writing quality",
402
  data: writingPromptsDataset,
403
  task: writingTask,
404
  evaluators: creativeWritingEvaluators
405
});
406
```
407

408
### Parallel Evaluation with Concurrency Control
409

410
Run experiments with AutoEvals evaluators and concurrency limits:
411

412
```typescript
413
import { Factuality, Levenshtein, ClosedQA } from 'autoevals';
414
import { createEvaluatorFromAutoevals } from '@langfuse/client';
415

416
const result = await langfuse.experiment.run({
417
  name: "Large Scale Evaluation",
418
  data: largeDataset, // 1000+ items
419
  task: myTask,
420
  evaluators: [
421
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
422
    createEvaluatorFromAutoevals(Levenshtein),
423
    createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })
424
  ],
425
  maxConcurrency: 10 // Limit concurrent task executions
426
});
427

428
// Evaluators run in parallel for each item
429
// But only 10 items are processed concurrently
430
console.log(`Processed ${result.itemResults.length} items`);
431
console.log(await result.format());
432
```
433

434
## Integration Patterns
435

436
### Pattern 1: Standard AutoEvals Integration
437

438
The most common pattern for using AutoEvals evaluators:
439

440
```typescript
441
import { Factuality, Levenshtein } from 'autoevals';
442
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
443

444
const langfuse = new LangfuseClient();
445

446
// Step 1: Wrap AutoEvals evaluators
447
const evaluators = [
448
  createEvaluatorFromAutoevals(Factuality),
449
  createEvaluatorFromAutoevals(Levenshtein)
450
];
451

452
// Step 2: Run experiment
453
const result = await langfuse.experiment.run({
454
  name: "My Experiment",
455
  data: myData,
456
  task: myTask,
457
  evaluators
458
});
459

460
// Step 3: Review results
461
console.log(await result.format());
462
```
463

464
### Pattern 2: Configured AutoEvals Integration
465

466
Use when you need to pass custom parameters to AutoEvals:
467

468
```typescript
469
import { Factuality, ClosedQA } from 'autoevals';
470
import { createEvaluatorFromAutoevals } from '@langfuse/client';
471

472
// Configure evaluators with custom parameters
473
const evaluators = [
474
  createEvaluatorFromAutoevals(Factuality, {
475
    model: 'gpt-4o',
476
    // model will be passed to AutoEvals Factuality evaluator
477
  }),
478
  createEvaluatorFromAutoevals(ClosedQA, {
479
    model: 'gpt-4-turbo',
480
    useCoT: true
481
  })
482
];
483

484
await langfuse.experiment.run({
485
  name: "Configured Evaluation",
486
  data: myData,
487
  task: myTask,
488
  evaluators
489
});
490
```
491

492
### Pattern 3: Hybrid Evaluation Strategy
493

494
Combine AutoEvals evaluators with custom evaluation logic:
495

496
```typescript
497
import { Factuality } from 'autoevals';
498
import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';
499

500
const hybridEvaluators = [
501
  // Use AutoEvals for complex evaluations
502
  createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
503

504
  // Use custom evaluators for domain-specific logic
505
  async ({ output, expectedOutput, metadata }): Promise<Evaluation> => ({
506
    name: "business_rule_check",
507
    value: checkBusinessRules(output, metadata) ? 1 : 0,
508
    comment: "Domain-specific business rule validation"
509
  })
510
];
511

512
await langfuse.experiment.run({
513
  name: "Hybrid Evaluation",
514
  data: myData,
515
  task: myTask,
516
  evaluators: hybridEvaluators
517
});
518
```
519

520
### Pattern 4: Progressive Evaluation
521

522
Start with simple evaluators and add more complex ones:
523

524
```typescript
525
import { Levenshtein, Factuality, ClosedQA } from 'autoevals';
526
import { createEvaluatorFromAutoevals } from '@langfuse/client';
527

528
// Phase 1: Quick evaluation with simple metrics
529
const quickEvaluators = [
530
  createEvaluatorFromAutoevals(Levenshtein)
531
];
532

533
const quickResult = await langfuse.experiment.run({
534
  name: "Quick Evaluation - Phase 1",
535
  data: myData,
536
  task: myTask,
537
  evaluators: quickEvaluators
538
});
539

540
// Analyze quick results...
541
console.log(await quickResult.format());
542

543
// Phase 2: Deep evaluation with LLM-based metrics
544
const deepEvaluators = [
545
  createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
546
  createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o', useCoT: true })
547
];
548

549
const deepResult = await langfuse.experiment.run({
550
  name: "Deep Evaluation - Phase 2",
551
  data: myData,
552
  task: myTask,
553
  evaluators: deepEvaluators
554
});
555

556
console.log(await deepResult.format());
557
```
558

559
## Best Practices
560

561
### 1. Choose Appropriate Evaluators
562

563
Select AutoEvals evaluators that match your evaluation needs:
564

565
```typescript
566
// For factual accuracy - use Factuality
567
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' })
568

569
// For text similarity - use Levenshtein
570
createEvaluatorFromAutoevals(Levenshtein)
571

572
// For closed-domain QA - use ClosedQA
573
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })
574

575
// For comparing two outputs - use Battle
576
createEvaluatorFromAutoevals(Battle, { model: 'gpt-4' })
577

578
// For code validation - use Sql, ValidJson, etc.
579
createEvaluatorFromAutoevals(ValidJson)
580
```
581

582
### 2. Configure Model Parameters
583

584
Always specify model parameters for LLM-based AutoEvals evaluators:
585

586
```typescript
587
// Good: Explicit model configuration
588
const evaluator = createEvaluatorFromAutoevals(Factuality, {
589
  model: 'gpt-4o'
590
});
591

592
// Less ideal: Relying on defaults (may vary)
593
const evaluator = createEvaluatorFromAutoevals(Factuality);
594
```
595

596
### 3. Mix Evaluator Types
597

598
Combine different types of evaluators for comprehensive assessment:
599

600
```typescript
601
const evaluators = [
602
  // Fast, deterministic evaluators
603
  createEvaluatorFromAutoevals(Levenshtein),
604
  createEvaluatorFromAutoevals(ValidJson),
605

606
  // LLM-based evaluators
607
  createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
608
  createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' }),
609

610
  // Custom domain-specific evaluators
611
  customBusinessLogicEvaluator
612
];
613
```
614

615
### 4. Handle Evaluation Costs
616

617
Be mindful of API costs when using LLM-based AutoEvals evaluators:
618

619
```typescript
620
// For large datasets, start with cheaper evaluators
621
const result = await langfuse.experiment.run({
622
  name: "Cost-Conscious Evaluation",
623
  data: largeDataset,
624
  task: myTask,
625
  evaluators: [
626
    // Free/cheap evaluators
627
    createEvaluatorFromAutoevals(Levenshtein),
628

629
    // Use GPT-4 selectively or use cheaper models
630
    createEvaluatorFromAutoevals(Factuality, {
631
      model: 'gpt-3.5-turbo'  // Cheaper alternative
632
    })
633
  ],
634
  maxConcurrency: 5 // Control rate limiting
635
});
636
```
637

638
### 5. Understand Parameter Mapping
639

640
Remember that the adapter automatically maps parameters:
641

642
```typescript
643
// Your Langfuse data
644
const data = [
645
  {
646
    input: "What is 2+2?",
647
    expectedOutput: "4"  // Note: expectedOutput (Langfuse format)
648
  }
649
];
650

651
// AutoEvals receives:
652
// {
653
//   input: "What is 2+2?",
654
//   output: <task result>,
655
//   expected: "4"  // Automatically mapped from expectedOutput
656
// }
657

658
const evaluator = createEvaluatorFromAutoevals(Factuality);
659
```
660

661
### 6. Test Evaluators Individually
662

663
Test AutoEvals evaluators with sample data before full experiments:
664

665
```typescript
666
import { Factuality } from 'autoevals';
667
import { createEvaluatorFromAutoevals } from '@langfuse/client';
668

669
// Create evaluator
670
const factualityEvaluator = createEvaluatorFromAutoevals(
671
  Factuality,
672
  { model: 'gpt-4o' }
673
);
674

675
// Test with sample data
676
const testResult = await langfuse.experiment.run({
677
  name: "Evaluator Test",
678
  data: [
679
    { input: "Test input", expectedOutput: "Test output" }
680
  ],
681
  task: async () => "Test result",
682
  evaluators: [factualityEvaluator]
683
});
684

685
console.log(await testResult.format());
686
// Verify evaluator works as expected before scaling up
687
```
688

689
### 7. Monitor Evaluation Results
690

691
Track evaluation scores across experiments:
692

693
```typescript
694
const result = await langfuse.experiment.run({
695
  name: "Production Evaluation",
696
  data: productionDataset,
697
  task: productionTask,
698
  evaluators: [
699
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
700
    createEvaluatorFromAutoevals(Levenshtein)
701
  ]
702
});
703

704
// Analyze scores
705
const factualityScores = result.itemResults
706
  .flatMap(r => r.evaluations)
707
  .filter(e => e.name === 'Factuality')
708
  .map(e => e.value);
709

710
const avgFactuality = factualityScores.reduce((a, b) => a + b, 0)
711
  / factualityScores.length;
712

713
console.log(`Average Factuality Score: ${avgFactuality}`);
714

715
// View detailed results in Langfuse UI
716
if (result.datasetRunUrl) {
717
  console.log(`View results: ${result.datasetRunUrl}`);
718
}
719
```
720

721
## Type Safety
722

723
The adapter provides full TypeScript type safety through the `Params<E>` utility type:
724

725
```typescript
726
import { Factuality } from 'autoevals';
727
import { createEvaluatorFromAutoevals } from '@langfuse/client';
728

729
// Type-safe parameter inference
730
const evaluator = createEvaluatorFromAutoevals(
731
  Factuality,
732
  {
733
    model: 'gpt-4o',  // ✓ Valid parameter
734
    temperature: 0.7,  // ✓ Valid parameter (if supported by Factuality)
735
    // @ts-expect-error: input/output/expected are handled by adapter
736
    input: "test",  // ✗ Error: input is omitted from params
737
    output: "test", // ✗ Error: output is omitted from params
738
    expected: "test" // ✗ Error: expected is omitted from params
739
  }
740
);
741

742
// The Params<E> type automatically:
743
// 1. Extracts parameter type from the evaluator function
744
// 2. Omits 'input', 'output', and 'expected' fields
745
// 3. Leaves only additional configuration parameters
746
```
747

748
## Error Handling
749

750
The adapter handles evaluation failures gracefully:
751

752
```typescript
753
import { Factuality } from 'autoevals';
754
import { createEvaluatorFromAutoevals } from '@langfuse/client';
755

756
const result = await langfuse.experiment.run({
757
  name: "Error Handling Test",
758
  data: myData,
759
  task: myTask,
760
  evaluators: [
761
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
762
    createEvaluatorFromAutoevals(Levenshtein)
763
  ]
764
});
765

766
// If one evaluator fails, others continue
767
// Failed evaluations are omitted from results
768
result.itemResults.forEach(item => {
769
  console.log(`Item evaluations: ${item.evaluations.length}`);
770
  // May have fewer evaluations if some failed
771
});
772
```
773

774
## Requirements
775

776
To use the AutoEvals integration, you need:
777

778
1. **Install AutoEvals**: `npm install autoevals`
779
2. **Install Langfuse Client**: `npm install @langfuse/client`
780
3. **API Keys**: Configure API keys for LLM-based evaluators (e.g., OpenAI API key for Factuality, ClosedQA, etc.)
781

782
```typescript
783
// Set up environment variables for LLM-based evaluators
784
// export OPENAI_API_KEY=your_openai_api_key
785

786
import { Factuality } from 'autoevals';
787
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
788

789
// LLM-based evaluators will use OPENAI_API_KEY from environment
790
const factualityEvaluator = createEvaluatorFromAutoevals(
791
  Factuality,
792
  { model: 'gpt-4o' }
793
);
794
```
795

796
## Related Documentation
797

798
- [Experiment Execution](/docs/experiments.md) - Complete experiment system documentation
799
- [Evaluator Types](/docs/experiments.md#evaluators) - Understanding evaluator functions
800
- [Dataset Management](/docs/datasets.md) - Working with Langfuse datasets
801
- [AutoEvals Library](https://github.com/braintrustdata/autoevals) - Official AutoEvals documentation
802

803
## Summary
804

805
The AutoEvals adapter provides:
806

807
- **Automatic Parameter Mapping**: Transparently maps Langfuse parameters to AutoEvals format
808
- **Result Transformation**: Converts AutoEvals results to Langfuse evaluation format
809
- **Type Safety**: Full TypeScript support with the `Params<E>` utility type
810
- **Seamless Integration**: Works with both `langfuse.experiment.run()` and `dataset.runExperiment()`
811
- **Flexible Configuration**: Pass custom parameters to AutoEvals evaluators
812
- **Hybrid Evaluation**: Mix AutoEvals and custom evaluators in the same experiment
813

814
This adapter enables you to leverage the comprehensive suite of AutoEvals metrics without writing custom evaluation code, while maintaining full compatibility with Langfuse's experiment system.
815

Version

Tile

Files

autoevals-adapter.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

autoevals-adapter.mddocs/