0
# AutoEvals Integration
1
2
The AutoEvals Integration provides a seamless adapter for using evaluators from the [AutoEvals library](https://github.com/braintrustdata/autoevals) with Langfuse experiments. This adapter handles parameter mapping and result formatting automatically, allowing you to leverage battle-tested evaluation metrics without writing custom evaluation code.
3
4
## Capabilities
5
6
### createEvaluatorFromAutoevals
7
8
Convert AutoEvals evaluators to Langfuse-compatible evaluator functions with automatic parameter mapping.
9
10
```typescript { .api }
11
/**
12
* Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function
13
*
14
* This adapter function bridges the gap between AutoEvals library evaluators
15
* and Langfuse experiment evaluators, handling parameter mapping and result
16
* formatting automatically.
17
*
18
* AutoEvals evaluators expect `input`, `output`, and `expected` parameters,
19
* while Langfuse evaluators use `input`, `output`, and `expectedOutput`.
20
* This function handles the parameter name mapping transparently.
21
*
22
* The adapter also transforms AutoEvals result format (with `name`, `score`,
23
* and `metadata` fields) to Langfuse evaluation format (with `name`, `value`,
24
* and `metadata` fields).
25
*
26
* @template E - Type of the AutoEvals evaluator function
27
* @param autoevalEvaluator - The AutoEvals evaluator function to convert
28
* @param params - Optional additional parameters to pass to the AutoEvals evaluator
29
* @returns A Langfuse-compatible evaluator function
30
*/
31
function createEvaluatorFromAutoevals<E extends CallableFunction>(
32
autoevalEvaluator: E,
33
params?: Params<E>
34
): Evaluator;
35
36
/**
37
* Utility type to extract parameter types from AutoEvals evaluator functions
38
*
39
* This type helper extracts the parameter type from an AutoEvals evaluator
40
* and omits the standard parameters (input, output, expected) that are
41
* handled by the adapter, leaving only the additional configuration parameters.
42
*
43
* @template E - The AutoEvals evaluator function type
44
*/
45
type Params<E> = Parameters<
46
E extends (...args: any[]) => any ? E : never
47
>[0] extends infer P
48
? Omit<P, "input" | "output" | "expected">
49
: never;
50
```
51
52
## Parameter Mapping
53
54
The adapter automatically handles the parameter name differences between AutoEvals and Langfuse:
55
56
| AutoEvals Parameter | Langfuse Parameter | Description |
57
|---------------------|-------------------|-------------|
58
| `input` | `input` | The input data passed to the task |
59
| `output` | `output` | The output produced by the task |
60
| `expected` | `expectedOutput` | The expected/ground truth output |
61
62
Additional parameters specified in the `params` argument are passed through to the AutoEvals evaluator without modification.
63
64
## Result Transformation
65
66
The adapter transforms AutoEvals results to Langfuse evaluation format:
67
68
```typescript
69
// AutoEvals result format
70
{
71
name: string;
72
score: number;
73
metadata?: Record<string, any>;
74
}
75
76
// Transformed to Langfuse format
77
{
78
name: string;
79
value: number; // mapped from score, defaults to 0 if undefined
80
metadata?: Record<string, any>;
81
}
82
```
83
84
## Usage Examples
85
86
### Basic Usage
87
88
Use AutoEvals evaluators directly with Langfuse experiments:
89
90
```typescript
91
import { Factuality, Levenshtein } from 'autoevals';
92
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
93
94
const langfuse = new LangfuseClient();
95
96
// Create wrapped evaluators
97
const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);
98
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);
99
100
// Use in experiment
101
const result = await langfuse.experiment.run({
102
name: "Capital Cities Test",
103
data: [
104
{ input: "France", expectedOutput: "Paris" },
105
{ input: "Germany", expectedOutput: "Berlin" },
106
{ input: "Japan", expectedOutput: "Tokyo" }
107
],
108
task: async ({ input }) => {
109
const response = await openai.chat.completions.create({
110
model: "gpt-4",
111
messages: [{
112
role: "user",
113
content: `What is the capital of ${input}?`
114
}]
115
});
116
return response.choices[0].message.content;
117
},
118
evaluators: [factualityEvaluator, levenshteinEvaluator]
119
});
120
121
console.log(await result.format());
122
```
123
124
### With Additional Parameters
125
126
Pass configuration parameters to AutoEvals evaluators:
127
128
```typescript
129
import { Factuality, ClosedQA, Battle } from 'autoevals';
130
import { createEvaluatorFromAutoevals } from '@langfuse/client';
131
132
// Configure Factuality evaluator with custom model
133
const factualityEvaluator = createEvaluatorFromAutoevals(
134
Factuality,
135
{ model: 'gpt-4o' }
136
);
137
138
// Configure ClosedQA with model and chain-of-thought
139
const closedQAEvaluator = createEvaluatorFromAutoevals(
140
ClosedQA,
141
{
142
model: 'gpt-4-turbo',
143
useCoT: true // Enable chain of thought reasoning
144
}
145
);
146
147
// Configure Battle evaluator for model comparison
148
const battleEvaluator = createEvaluatorFromAutoevals(
149
Battle,
150
{
151
model: 'gpt-4',
152
instructions: 'Compare which response is more accurate and helpful'
153
}
154
);
155
156
await langfuse.experiment.run({
157
name: "Configured Evaluators Test",
158
data: qaDataset,
159
task: myTask,
160
evaluators: [
161
factualityEvaluator,
162
closedQAEvaluator,
163
battleEvaluator
164
]
165
});
166
```
167
168
### Common AutoEvals Evaluators
169
170
Examples using popular AutoEvals evaluators:
171
172
```typescript
173
import {
174
Factuality,
175
Levenshtein,
176
ClosedQA,
177
Battle,
178
Humor,
179
Security,
180
Sql,
181
ValidJson,
182
AnswerRelevancy
183
} from 'autoevals';
184
import { createEvaluatorFromAutoevals } from '@langfuse/client';
185
186
// Text similarity and accuracy
187
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);
188
189
// Factuality checking (requires OpenAI)
190
const factualityEvaluator = createEvaluatorFromAutoevals(
191
Factuality,
192
{ model: 'gpt-4o' }
193
);
194
195
// Closed-domain QA evaluation
196
const closedQAEvaluator = createEvaluatorFromAutoevals(
197
ClosedQA,
198
{ model: 'gpt-4o' }
199
);
200
201
// Model comparison
202
const battleEvaluator = createEvaluatorFromAutoevals(
203
Battle,
204
{ model: 'gpt-4' }
205
);
206
207
// Humor detection
208
const humorEvaluator = createEvaluatorFromAutoevals(
209
Humor,
210
{ model: 'gpt-4o' }
211
);
212
213
// Security checking
214
const securityEvaluator = createEvaluatorFromAutoevals(
215
Security,
216
{ model: 'gpt-4o' }
217
);
218
219
// SQL validation
220
const sqlEvaluator = createEvaluatorFromAutoevals(Sql);
221
222
// JSON validation
223
const jsonEvaluator = createEvaluatorFromAutoevals(ValidJson);
224
225
// Answer relevancy
226
const relevancyEvaluator = createEvaluatorFromAutoevals(
227
AnswerRelevancy,
228
{ model: 'gpt-4o' }
229
);
230
231
// Use multiple evaluators for comprehensive assessment
232
await langfuse.experiment.run({
233
name: "Comprehensive QA Evaluation",
234
data: qaDataset,
235
task: qaTask,
236
evaluators: [
237
levenshteinEvaluator,
238
factualityEvaluator,
239
closedQAEvaluator,
240
relevancyEvaluator
241
]
242
});
243
```
244
245
### With Langfuse Datasets
246
247
Use AutoEvals evaluators when running experiments on Langfuse datasets:
248
249
```typescript
250
import { Factuality, Levenshtein } from 'autoevals';
251
import { createEvaluatorFromAutoevals } from '@langfuse/client';
252
253
const langfuse = new LangfuseClient();
254
255
// Fetch dataset from Langfuse
256
const dataset = await langfuse.dataset.get("qa-evaluation-dataset");
257
258
// Run experiment with AutoEvals evaluators
259
const result = await dataset.runExperiment({
260
name: "GPT-4 QA Evaluation",
261
description: "Evaluating GPT-4 performance on QA dataset",
262
task: async ({ input }) => {
263
const response = await openai.chat.completions.create({
264
model: "gpt-4",
265
messages: [{ role: "user", content: input }]
266
});
267
return response.choices[0].message.content;
268
},
269
evaluators: [
270
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
271
createEvaluatorFromAutoevals(Levenshtein)
272
]
273
});
274
275
console.log(`Dataset Run URL: ${result.datasetRunUrl}`);
276
console.log(await result.format());
277
```
278
279
### Combining AutoEvals and Custom Evaluators
280
281
Mix AutoEvals evaluators with your own custom evaluation logic:
282
283
```typescript
284
import { Factuality, Levenshtein } from 'autoevals';
285
import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';
286
287
// Custom evaluator
288
const exactMatchEvaluator: Evaluator = async ({ output, expectedOutput }) => ({
289
name: "exact_match",
290
value: output === expectedOutput ? 1 : 0,
291
comment: output === expectedOutput ? "Perfect match" : "No match"
292
});
293
294
// Custom evaluator with metadata
295
const lengthEvaluator: Evaluator = async ({ output, expectedOutput }) => {
296
const outputLen = output?.length || 0;
297
const expectedLen = expectedOutput?.length || 0;
298
const lengthDiff = Math.abs(outputLen - expectedLen);
299
300
return {
301
name: "length_similarity",
302
value: 1 - (lengthDiff / Math.max(outputLen, expectedLen, 1)),
303
metadata: {
304
outputLength: outputLen,
305
expectedLength: expectedLen,
306
difference: lengthDiff
307
}
308
};
309
};
310
311
// Custom multi-evaluation evaluator
312
const comprehensiveCustomEvaluator: Evaluator = async ({
313
input,
314
output,
315
expectedOutput
316
}) => {
317
return [
318
{
319
name: "contains_expected",
320
value: output.includes(expectedOutput) ? 1 : 0
321
},
322
{
323
name: "case_sensitive_match",
324
value: output === expectedOutput ? 1 : 0
325
},
326
{
327
name: "case_insensitive_match",
328
value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0
329
}
330
];
331
};
332
333
// Combine everything
334
await langfuse.experiment.run({
335
name: "Mixed Evaluators Experiment",
336
data: dataset,
337
task: myTask,
338
evaluators: [
339
// AutoEvals evaluators
340
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
341
createEvaluatorFromAutoevals(Levenshtein),
342
// Custom evaluators
343
exactMatchEvaluator,
344
lengthEvaluator,
345
comprehensiveCustomEvaluator
346
]
347
});
348
```
349
350
### Advanced: Domain-Specific Evaluations
351
352
Configure AutoEvals evaluators for specific domains:
353
354
```typescript
355
import { Factuality, ClosedQA, Security } from 'autoevals';
356
import { createEvaluatorFromAutoevals } from '@langfuse/client';
357
358
// Medical QA evaluation
359
const medicalQAEvaluators = [
360
createEvaluatorFromAutoevals(Factuality, {
361
model: 'gpt-4o',
362
// Additional context can be provided through metadata
363
}),
364
createEvaluatorFromAutoevals(ClosedQA, {
365
model: 'gpt-4-turbo',
366
useCoT: true
367
})
368
];
369
370
await langfuse.experiment.run({
371
name: "Medical QA Evaluation",
372
description: "Evaluating medical question answering accuracy",
373
data: medicalQADataset,
374
task: medicalQATask,
375
evaluators: medicalQAEvaluators
376
});
377
378
// Code generation evaluation
379
const codeGenerationEvaluators = [
380
createEvaluatorFromAutoevals(Security, { model: 'gpt-4o' }),
381
createEvaluatorFromAutoevals(ValidJson), // If generating JSON
382
createEvaluatorFromAutoevals(Sql) // If generating SQL
383
];
384
385
await langfuse.experiment.run({
386
name: "Code Generation Quality",
387
description: "Evaluating generated code for security and validity",
388
data: codeGenDataset,
389
task: codeGenTask,
390
evaluators: codeGenerationEvaluators
391
});
392
393
// Creative writing evaluation
394
const creativeWritingEvaluators = [
395
createEvaluatorFromAutoevals(Humor, { model: 'gpt-4o' }),
396
createEvaluatorFromAutoevals(AnswerRelevancy, { model: 'gpt-4o' })
397
];
398
399
await langfuse.experiment.run({
400
name: "Creative Writing Assessment",
401
description: "Evaluating creative writing quality",
402
data: writingPromptsDataset,
403
task: writingTask,
404
evaluators: creativeWritingEvaluators
405
});
406
```
407
408
### Parallel Evaluation with Concurrency Control
409
410
Run experiments with AutoEvals evaluators and concurrency limits:
411
412
```typescript
413
import { Factuality, Levenshtein, ClosedQA } from 'autoevals';
414
import { createEvaluatorFromAutoevals } from '@langfuse/client';
415
416
const result = await langfuse.experiment.run({
417
name: "Large Scale Evaluation",
418
data: largeDataset, // 1000+ items
419
task: myTask,
420
evaluators: [
421
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
422
createEvaluatorFromAutoevals(Levenshtein),
423
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })
424
],
425
maxConcurrency: 10 // Limit concurrent task executions
426
});
427
428
// Evaluators run in parallel for each item
429
// But only 10 items are processed concurrently
430
console.log(`Processed ${result.itemResults.length} items`);
431
console.log(await result.format());
432
```
433
434
## Integration Patterns
435
436
### Pattern 1: Standard AutoEvals Integration
437
438
The most common pattern for using AutoEvals evaluators:
439
440
```typescript
441
import { Factuality, Levenshtein } from 'autoevals';
442
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
443
444
const langfuse = new LangfuseClient();
445
446
// Step 1: Wrap AutoEvals evaluators
447
const evaluators = [
448
createEvaluatorFromAutoevals(Factuality),
449
createEvaluatorFromAutoevals(Levenshtein)
450
];
451
452
// Step 2: Run experiment
453
const result = await langfuse.experiment.run({
454
name: "My Experiment",
455
data: myData,
456
task: myTask,
457
evaluators
458
});
459
460
// Step 3: Review results
461
console.log(await result.format());
462
```
463
464
### Pattern 2: Configured AutoEvals Integration
465
466
Use when you need to pass custom parameters to AutoEvals:
467
468
```typescript
469
import { Factuality, ClosedQA } from 'autoevals';
470
import { createEvaluatorFromAutoevals } from '@langfuse/client';
471
472
// Configure evaluators with custom parameters
473
const evaluators = [
474
createEvaluatorFromAutoevals(Factuality, {
475
model: 'gpt-4o',
476
// model will be passed to AutoEvals Factuality evaluator
477
}),
478
createEvaluatorFromAutoevals(ClosedQA, {
479
model: 'gpt-4-turbo',
480
useCoT: true
481
})
482
];
483
484
await langfuse.experiment.run({
485
name: "Configured Evaluation",
486
data: myData,
487
task: myTask,
488
evaluators
489
});
490
```
491
492
### Pattern 3: Hybrid Evaluation Strategy
493
494
Combine AutoEvals evaluators with custom evaluation logic:
495
496
```typescript
497
import { Factuality } from 'autoevals';
498
import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';
499
500
const hybridEvaluators = [
501
// Use AutoEvals for complex evaluations
502
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
503
504
// Use custom evaluators for domain-specific logic
505
async ({ output, expectedOutput, metadata }): Promise<Evaluation> => ({
506
name: "business_rule_check",
507
value: checkBusinessRules(output, metadata) ? 1 : 0,
508
comment: "Domain-specific business rule validation"
509
})
510
];
511
512
await langfuse.experiment.run({
513
name: "Hybrid Evaluation",
514
data: myData,
515
task: myTask,
516
evaluators: hybridEvaluators
517
});
518
```
519
520
### Pattern 4: Progressive Evaluation
521
522
Start with simple evaluators and add more complex ones:
523
524
```typescript
525
import { Levenshtein, Factuality, ClosedQA } from 'autoevals';
526
import { createEvaluatorFromAutoevals } from '@langfuse/client';
527
528
// Phase 1: Quick evaluation with simple metrics
529
const quickEvaluators = [
530
createEvaluatorFromAutoevals(Levenshtein)
531
];
532
533
const quickResult = await langfuse.experiment.run({
534
name: "Quick Evaluation - Phase 1",
535
data: myData,
536
task: myTask,
537
evaluators: quickEvaluators
538
});
539
540
// Analyze quick results...
541
console.log(await quickResult.format());
542
543
// Phase 2: Deep evaluation with LLM-based metrics
544
const deepEvaluators = [
545
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
546
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o', useCoT: true })
547
];
548
549
const deepResult = await langfuse.experiment.run({
550
name: "Deep Evaluation - Phase 2",
551
data: myData,
552
task: myTask,
553
evaluators: deepEvaluators
554
});
555
556
console.log(await deepResult.format());
557
```
558
559
## Best Practices
560
561
### 1. Choose Appropriate Evaluators
562
563
Select AutoEvals evaluators that match your evaluation needs:
564
565
```typescript
566
// For factual accuracy - use Factuality
567
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' })
568
569
// For text similarity - use Levenshtein
570
createEvaluatorFromAutoevals(Levenshtein)
571
572
// For closed-domain QA - use ClosedQA
573
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })
574
575
// For comparing two outputs - use Battle
576
createEvaluatorFromAutoevals(Battle, { model: 'gpt-4' })
577
578
// For code validation - use Sql, ValidJson, etc.
579
createEvaluatorFromAutoevals(ValidJson)
580
```
581
582
### 2. Configure Model Parameters
583
584
Always specify model parameters for LLM-based AutoEvals evaluators:
585
586
```typescript
587
// Good: Explicit model configuration
588
const evaluator = createEvaluatorFromAutoevals(Factuality, {
589
model: 'gpt-4o'
590
});
591
592
// Less ideal: Relying on defaults (may vary)
593
const evaluator = createEvaluatorFromAutoevals(Factuality);
594
```
595
596
### 3. Mix Evaluator Types
597
598
Combine different types of evaluators for comprehensive assessment:
599
600
```typescript
601
const evaluators = [
602
// Fast, deterministic evaluators
603
createEvaluatorFromAutoevals(Levenshtein),
604
createEvaluatorFromAutoevals(ValidJson),
605
606
// LLM-based evaluators
607
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
608
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' }),
609
610
// Custom domain-specific evaluators
611
customBusinessLogicEvaluator
612
];
613
```
614
615
### 4. Handle Evaluation Costs
616
617
Be mindful of API costs when using LLM-based AutoEvals evaluators:
618
619
```typescript
620
// For large datasets, start with cheaper evaluators
621
const result = await langfuse.experiment.run({
622
name: "Cost-Conscious Evaluation",
623
data: largeDataset,
624
task: myTask,
625
evaluators: [
626
// Free/cheap evaluators
627
createEvaluatorFromAutoevals(Levenshtein),
628
629
// Use GPT-4 selectively or use cheaper models
630
createEvaluatorFromAutoevals(Factuality, {
631
model: 'gpt-3.5-turbo' // Cheaper alternative
632
})
633
],
634
maxConcurrency: 5 // Control rate limiting
635
});
636
```
637
638
### 5. Understand Parameter Mapping
639
640
Remember that the adapter automatically maps parameters:
641
642
```typescript
643
// Your Langfuse data
644
const data = [
645
{
646
input: "What is 2+2?",
647
expectedOutput: "4" // Note: expectedOutput (Langfuse format)
648
}
649
];
650
651
// AutoEvals receives:
652
// {
653
// input: "What is 2+2?",
654
// output: <task result>,
655
// expected: "4" // Automatically mapped from expectedOutput
656
// }
657
658
const evaluator = createEvaluatorFromAutoevals(Factuality);
659
```
660
661
### 6. Test Evaluators Individually
662
663
Test AutoEvals evaluators with sample data before full experiments:
664
665
```typescript
666
import { Factuality } from 'autoevals';
667
import { createEvaluatorFromAutoevals } from '@langfuse/client';
668
669
// Create evaluator
670
const factualityEvaluator = createEvaluatorFromAutoevals(
671
Factuality,
672
{ model: 'gpt-4o' }
673
);
674
675
// Test with sample data
676
const testResult = await langfuse.experiment.run({
677
name: "Evaluator Test",
678
data: [
679
{ input: "Test input", expectedOutput: "Test output" }
680
],
681
task: async () => "Test result",
682
evaluators: [factualityEvaluator]
683
});
684
685
console.log(await testResult.format());
686
// Verify evaluator works as expected before scaling up
687
```
688
689
### 7. Monitor Evaluation Results
690
691
Track evaluation scores across experiments:
692
693
```typescript
694
const result = await langfuse.experiment.run({
695
name: "Production Evaluation",
696
data: productionDataset,
697
task: productionTask,
698
evaluators: [
699
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
700
createEvaluatorFromAutoevals(Levenshtein)
701
]
702
});
703
704
// Analyze scores
705
const factualityScores = result.itemResults
706
.flatMap(r => r.evaluations)
707
.filter(e => e.name === 'Factuality')
708
.map(e => e.value);
709
710
const avgFactuality = factualityScores.reduce((a, b) => a + b, 0)
711
/ factualityScores.length;
712
713
console.log(`Average Factuality Score: ${avgFactuality}`);
714
715
// View detailed results in Langfuse UI
716
if (result.datasetRunUrl) {
717
console.log(`View results: ${result.datasetRunUrl}`);
718
}
719
```
720
721
## Type Safety
722
723
The adapter provides full TypeScript type safety through the `Params<E>` utility type:
724
725
```typescript
726
import { Factuality } from 'autoevals';
727
import { createEvaluatorFromAutoevals } from '@langfuse/client';
728
729
// Type-safe parameter inference
730
const evaluator = createEvaluatorFromAutoevals(
731
Factuality,
732
{
733
model: 'gpt-4o', // ✓ Valid parameter
734
temperature: 0.7, // ✓ Valid parameter (if supported by Factuality)
735
// @ts-expect-error: input/output/expected are handled by adapter
736
input: "test", // ✗ Error: input is omitted from params
737
output: "test", // ✗ Error: output is omitted from params
738
expected: "test" // ✗ Error: expected is omitted from params
739
}
740
);
741
742
// The Params<E> type automatically:
743
// 1. Extracts parameter type from the evaluator function
744
// 2. Omits 'input', 'output', and 'expected' fields
745
// 3. Leaves only additional configuration parameters
746
```
747
748
## Error Handling
749
750
The adapter handles evaluation failures gracefully:
751
752
```typescript
753
import { Factuality } from 'autoevals';
754
import { createEvaluatorFromAutoevals } from '@langfuse/client';
755
756
const result = await langfuse.experiment.run({
757
name: "Error Handling Test",
758
data: myData,
759
task: myTask,
760
evaluators: [
761
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
762
createEvaluatorFromAutoevals(Levenshtein)
763
]
764
});
765
766
// If one evaluator fails, others continue
767
// Failed evaluations are omitted from results
768
result.itemResults.forEach(item => {
769
console.log(`Item evaluations: ${item.evaluations.length}`);
770
// May have fewer evaluations if some failed
771
});
772
```
773
774
## Requirements
775
776
To use the AutoEvals integration, you need:
777
778
1. **Install AutoEvals**: `npm install autoevals`
779
2. **Install Langfuse Client**: `npm install @langfuse/client`
780
3. **API Keys**: Configure API keys for LLM-based evaluators (e.g., OpenAI API key for Factuality, ClosedQA, etc.)
781
782
```typescript
783
// Set up environment variables for LLM-based evaluators
784
// export OPENAI_API_KEY=your_openai_api_key
785
786
import { Factuality } from 'autoevals';
787
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
788
789
// LLM-based evaluators will use OPENAI_API_KEY from environment
790
const factualityEvaluator = createEvaluatorFromAutoevals(
791
Factuality,
792
{ model: 'gpt-4o' }
793
);
794
```
795
796
## Related Documentation
797
798
- [Experiment Execution](/docs/experiments.md) - Complete experiment system documentation
799
- [Evaluator Types](/docs/experiments.md#evaluators) - Understanding evaluator functions
800
- [Dataset Management](/docs/datasets.md) - Working with Langfuse datasets
801
- [AutoEvals Library](https://github.com/braintrustdata/autoevals) - Official AutoEvals documentation
802
803
## Summary
804
805
The AutoEvals adapter provides:
806
807
- **Automatic Parameter Mapping**: Transparently maps Langfuse parameters to AutoEvals format
808
- **Result Transformation**: Converts AutoEvals results to Langfuse evaluation format
809
- **Type Safety**: Full TypeScript support with the `Params<E>` utility type
810
- **Seamless Integration**: Works with both `langfuse.experiment.run()` and `dataset.runExperiment()`
811
- **Flexible Configuration**: Pass custom parameters to AutoEvals evaluators
812
- **Hybrid Evaluation**: Mix AutoEvals and custom evaluators in the same experiment
813
814
This adapter enables you to leverage the comprehensive suite of AutoEvals metrics without writing custom evaluation code, while maintaining full compatibility with Langfuse's experiment system.
815