Langfuse API client for universal JavaScript environments providing observability, prompt management, datasets, experiments, and scoring capabilities
The AutoEvals Integration provides a seamless adapter for using evaluators from the AutoEvals library with Langfuse experiments. This adapter handles parameter mapping and result formatting automatically, allowing you to leverage battle-tested evaluation metrics without writing custom evaluation code.
Convert AutoEvals evaluators to Langfuse-compatible evaluator functions with automatic parameter mapping.
/**
* Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function
*
* This adapter function bridges the gap between AutoEvals library evaluators
* and Langfuse experiment evaluators, handling parameter mapping and result
* formatting automatically.
*
* AutoEvals evaluators expect `input`, `output`, and `expected` parameters,
* while Langfuse evaluators use `input`, `output`, and `expectedOutput`.
* This function handles the parameter name mapping transparently.
*
* The adapter also transforms AutoEvals result format (with `name`, `score`,
* and `metadata` fields) to Langfuse evaluation format (with `name`, `value`,
* and `metadata` fields).
*
* @template E - Type of the AutoEvals evaluator function
* @param autoevalEvaluator - The AutoEvals evaluator function to convert
* @param params - Optional additional parameters to pass to the AutoEvals evaluator
* @returns A Langfuse-compatible evaluator function
*/
function createEvaluatorFromAutoevals<E extends CallableFunction>(
autoevalEvaluator: E,
params?: Params<E>
): Evaluator;
/**
* Utility type to extract parameter types from AutoEvals evaluator functions
*
* This type helper extracts the parameter type from an AutoEvals evaluator
* and omits the standard parameters (input, output, expected) that are
* handled by the adapter, leaving only the additional configuration parameters.
*
* @template E - The AutoEvals evaluator function type
*/
type Params<E> = Parameters<
E extends (...args: any[]) => any ? E : never
>[0] extends infer P
? Omit<P, "input" | "output" | "expected">
: never;The adapter automatically handles the parameter name differences between AutoEvals and Langfuse:
| AutoEvals Parameter | Langfuse Parameter | Description |
|---|---|---|
input | input | The input data passed to the task |
output | output | The output produced by the task |
expected | expectedOutput | The expected/ground truth output |
Additional parameters specified in the params argument are passed through to the AutoEvals evaluator without modification.
The adapter transforms AutoEvals results to Langfuse evaluation format:
// AutoEvals result format
{
name: string;
score: number;
metadata?: Record<string, any>;
}
// Transformed to Langfuse format
{
name: string;
value: number; // mapped from score, defaults to 0 if undefined
metadata?: Record<string, any>;
}Use AutoEvals evaluators directly with Langfuse experiments:
import { Factuality, Levenshtein } from 'autoevals';
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
const langfuse = new LangfuseClient();
// Create wrapped evaluators
const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);
// Use in experiment
const result = await langfuse.experiment.run({
name: "Capital Cities Test",
data: [
{ input: "France", expectedOutput: "Paris" },
{ input: "Germany", expectedOutput: "Berlin" },
{ input: "Japan", expectedOutput: "Tokyo" }
],
task: async ({ input }) => {
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{
role: "user",
content: `What is the capital of ${input}?`
}]
});
return response.choices[0].message.content;
},
evaluators: [factualityEvaluator, levenshteinEvaluator]
});
console.log(await result.format());Pass configuration parameters to AutoEvals evaluators:
import { Factuality, ClosedQA, Battle } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
// Configure Factuality evaluator with custom model
const factualityEvaluator = createEvaluatorFromAutoevals(
Factuality,
{ model: 'gpt-4o' }
);
// Configure ClosedQA with model and chain-of-thought
const closedQAEvaluator = createEvaluatorFromAutoevals(
ClosedQA,
{
model: 'gpt-4-turbo',
useCoT: true // Enable chain of thought reasoning
}
);
// Configure Battle evaluator for model comparison
const battleEvaluator = createEvaluatorFromAutoevals(
Battle,
{
model: 'gpt-4',
instructions: 'Compare which response is more accurate and helpful'
}
);
await langfuse.experiment.run({
name: "Configured Evaluators Test",
data: qaDataset,
task: myTask,
evaluators: [
factualityEvaluator,
closedQAEvaluator,
battleEvaluator
]
});Examples using popular AutoEvals evaluators:
import {
Factuality,
Levenshtein,
ClosedQA,
Battle,
Humor,
Security,
Sql,
ValidJson,
AnswerRelevancy
} from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
// Text similarity and accuracy
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);
// Factuality checking (requires OpenAI)
const factualityEvaluator = createEvaluatorFromAutoevals(
Factuality,
{ model: 'gpt-4o' }
);
// Closed-domain QA evaluation
const closedQAEvaluator = createEvaluatorFromAutoevals(
ClosedQA,
{ model: 'gpt-4o' }
);
// Model comparison
const battleEvaluator = createEvaluatorFromAutoevals(
Battle,
{ model: 'gpt-4' }
);
// Humor detection
const humorEvaluator = createEvaluatorFromAutoevals(
Humor,
{ model: 'gpt-4o' }
);
// Security checking
const securityEvaluator = createEvaluatorFromAutoevals(
Security,
{ model: 'gpt-4o' }
);
// SQL validation
const sqlEvaluator = createEvaluatorFromAutoevals(Sql);
// JSON validation
const jsonEvaluator = createEvaluatorFromAutoevals(ValidJson);
// Answer relevancy
const relevancyEvaluator = createEvaluatorFromAutoevals(
AnswerRelevancy,
{ model: 'gpt-4o' }
);
// Use multiple evaluators for comprehensive assessment
await langfuse.experiment.run({
name: "Comprehensive QA Evaluation",
data: qaDataset,
task: qaTask,
evaluators: [
levenshteinEvaluator,
factualityEvaluator,
closedQAEvaluator,
relevancyEvaluator
]
});Use AutoEvals evaluators when running experiments on Langfuse datasets:
import { Factuality, Levenshtein } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
const langfuse = new LangfuseClient();
// Fetch dataset from Langfuse
const dataset = await langfuse.dataset.get("qa-evaluation-dataset");
// Run experiment with AutoEvals evaluators
const result = await dataset.runExperiment({
name: "GPT-4 QA Evaluation",
description: "Evaluating GPT-4 performance on QA dataset",
task: async ({ input }) => {
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: input }]
});
return response.choices[0].message.content;
},
evaluators: [
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(Levenshtein)
]
});
console.log(`Dataset Run URL: ${result.datasetRunUrl}`);
console.log(await result.format());Mix AutoEvals evaluators with your own custom evaluation logic:
import { Factuality, Levenshtein } from 'autoevals';
import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';
// Custom evaluator
const exactMatchEvaluator: Evaluator = async ({ output, expectedOutput }) => ({
name: "exact_match",
value: output === expectedOutput ? 1 : 0,
comment: output === expectedOutput ? "Perfect match" : "No match"
});
// Custom evaluator with metadata
const lengthEvaluator: Evaluator = async ({ output, expectedOutput }) => {
const outputLen = output?.length || 0;
const expectedLen = expectedOutput?.length || 0;
const lengthDiff = Math.abs(outputLen - expectedLen);
return {
name: "length_similarity",
value: 1 - (lengthDiff / Math.max(outputLen, expectedLen, 1)),
metadata: {
outputLength: outputLen,
expectedLength: expectedLen,
difference: lengthDiff
}
};
};
// Custom multi-evaluation evaluator
const comprehensiveCustomEvaluator: Evaluator = async ({
input,
output,
expectedOutput
}) => {
return [
{
name: "contains_expected",
value: output.includes(expectedOutput) ? 1 : 0
},
{
name: "case_sensitive_match",
value: output === expectedOutput ? 1 : 0
},
{
name: "case_insensitive_match",
value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0
}
];
};
// Combine everything
await langfuse.experiment.run({
name: "Mixed Evaluators Experiment",
data: dataset,
task: myTask,
evaluators: [
// AutoEvals evaluators
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(Levenshtein),
// Custom evaluators
exactMatchEvaluator,
lengthEvaluator,
comprehensiveCustomEvaluator
]
});Configure AutoEvals evaluators for specific domains:
import { Factuality, ClosedQA, Security } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
// Medical QA evaluation
const medicalQAEvaluators = [
createEvaluatorFromAutoevals(Factuality, {
model: 'gpt-4o',
// Additional context can be provided through metadata
}),
createEvaluatorFromAutoevals(ClosedQA, {
model: 'gpt-4-turbo',
useCoT: true
})
];
await langfuse.experiment.run({
name: "Medical QA Evaluation",
description: "Evaluating medical question answering accuracy",
data: medicalQADataset,
task: medicalQATask,
evaluators: medicalQAEvaluators
});
// Code generation evaluation
const codeGenerationEvaluators = [
createEvaluatorFromAutoevals(Security, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(ValidJson), // If generating JSON
createEvaluatorFromAutoevals(Sql) // If generating SQL
];
await langfuse.experiment.run({
name: "Code Generation Quality",
description: "Evaluating generated code for security and validity",
data: codeGenDataset,
task: codeGenTask,
evaluators: codeGenerationEvaluators
});
// Creative writing evaluation
const creativeWritingEvaluators = [
createEvaluatorFromAutoevals(Humor, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(AnswerRelevancy, { model: 'gpt-4o' })
];
await langfuse.experiment.run({
name: "Creative Writing Assessment",
description: "Evaluating creative writing quality",
data: writingPromptsDataset,
task: writingTask,
evaluators: creativeWritingEvaluators
});Run experiments with AutoEvals evaluators and concurrency limits:
import { Factuality, Levenshtein, ClosedQA } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
const result = await langfuse.experiment.run({
name: "Large Scale Evaluation",
data: largeDataset, // 1000+ items
task: myTask,
evaluators: [
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(Levenshtein),
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })
],
maxConcurrency: 10 // Limit concurrent task executions
});
// Evaluators run in parallel for each item
// But only 10 items are processed concurrently
console.log(`Processed ${result.itemResults.length} items`);
console.log(await result.format());The most common pattern for using AutoEvals evaluators:
import { Factuality, Levenshtein } from 'autoevals';
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
const langfuse = new LangfuseClient();
// Step 1: Wrap AutoEvals evaluators
const evaluators = [
createEvaluatorFromAutoevals(Factuality),
createEvaluatorFromAutoevals(Levenshtein)
];
// Step 2: Run experiment
const result = await langfuse.experiment.run({
name: "My Experiment",
data: myData,
task: myTask,
evaluators
});
// Step 3: Review results
console.log(await result.format());Use when you need to pass custom parameters to AutoEvals:
import { Factuality, ClosedQA } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
// Configure evaluators with custom parameters
const evaluators = [
createEvaluatorFromAutoevals(Factuality, {
model: 'gpt-4o',
// model will be passed to AutoEvals Factuality evaluator
}),
createEvaluatorFromAutoevals(ClosedQA, {
model: 'gpt-4-turbo',
useCoT: true
})
];
await langfuse.experiment.run({
name: "Configured Evaluation",
data: myData,
task: myTask,
evaluators
});Combine AutoEvals evaluators with custom evaluation logic:
import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';
const hybridEvaluators = [
// Use AutoEvals for complex evaluations
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
// Use custom evaluators for domain-specific logic
async ({ output, expectedOutput, metadata }): Promise<Evaluation> => ({
name: "business_rule_check",
value: checkBusinessRules(output, metadata) ? 1 : 0,
comment: "Domain-specific business rule validation"
})
];
await langfuse.experiment.run({
name: "Hybrid Evaluation",
data: myData,
task: myTask,
evaluators: hybridEvaluators
});Start with simple evaluators and add more complex ones:
import { Levenshtein, Factuality, ClosedQA } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
// Phase 1: Quick evaluation with simple metrics
const quickEvaluators = [
createEvaluatorFromAutoevals(Levenshtein)
];
const quickResult = await langfuse.experiment.run({
name: "Quick Evaluation - Phase 1",
data: myData,
task: myTask,
evaluators: quickEvaluators
});
// Analyze quick results...
console.log(await quickResult.format());
// Phase 2: Deep evaluation with LLM-based metrics
const deepEvaluators = [
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o', useCoT: true })
];
const deepResult = await langfuse.experiment.run({
name: "Deep Evaluation - Phase 2",
data: myData,
task: myTask,
evaluators: deepEvaluators
});
console.log(await deepResult.format());Select AutoEvals evaluators that match your evaluation needs:
// For factual accuracy - use Factuality
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' })
// For text similarity - use Levenshtein
createEvaluatorFromAutoevals(Levenshtein)
// For closed-domain QA - use ClosedQA
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })
// For comparing two outputs - use Battle
createEvaluatorFromAutoevals(Battle, { model: 'gpt-4' })
// For code validation - use Sql, ValidJson, etc.
createEvaluatorFromAutoevals(ValidJson)Always specify model parameters for LLM-based AutoEvals evaluators:
// Good: Explicit model configuration
const evaluator = createEvaluatorFromAutoevals(Factuality, {
model: 'gpt-4o'
});
// Less ideal: Relying on defaults (may vary)
const evaluator = createEvaluatorFromAutoevals(Factuality);Combine different types of evaluators for comprehensive assessment:
const evaluators = [
// Fast, deterministic evaluators
createEvaluatorFromAutoevals(Levenshtein),
createEvaluatorFromAutoevals(ValidJson),
// LLM-based evaluators
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' }),
// Custom domain-specific evaluators
customBusinessLogicEvaluator
];Be mindful of API costs when using LLM-based AutoEvals evaluators:
// For large datasets, start with cheaper evaluators
const result = await langfuse.experiment.run({
name: "Cost-Conscious Evaluation",
data: largeDataset,
task: myTask,
evaluators: [
// Free/cheap evaluators
createEvaluatorFromAutoevals(Levenshtein),
// Use GPT-4 selectively or use cheaper models
createEvaluatorFromAutoevals(Factuality, {
model: 'gpt-3.5-turbo' // Cheaper alternative
})
],
maxConcurrency: 5 // Control rate limiting
});Remember that the adapter automatically maps parameters:
// Your Langfuse data
const data = [
{
input: "What is 2+2?",
expectedOutput: "4" // Note: expectedOutput (Langfuse format)
}
];
// AutoEvals receives:
// {
// input: "What is 2+2?",
// output: <task result>,
// expected: "4" // Automatically mapped from expectedOutput
// }
const evaluator = createEvaluatorFromAutoevals(Factuality);Test AutoEvals evaluators with sample data before full experiments:
import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
// Create evaluator
const factualityEvaluator = createEvaluatorFromAutoevals(
Factuality,
{ model: 'gpt-4o' }
);
// Test with sample data
const testResult = await langfuse.experiment.run({
name: "Evaluator Test",
data: [
{ input: "Test input", expectedOutput: "Test output" }
],
task: async () => "Test result",
evaluators: [factualityEvaluator]
});
console.log(await testResult.format());
// Verify evaluator works as expected before scaling upTrack evaluation scores across experiments:
const result = await langfuse.experiment.run({
name: "Production Evaluation",
data: productionDataset,
task: productionTask,
evaluators: [
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(Levenshtein)
]
});
// Analyze scores
const factualityScores = result.itemResults
.flatMap(r => r.evaluations)
.filter(e => e.name === 'Factuality')
.map(e => e.value);
const avgFactuality = factualityScores.reduce((a, b) => a + b, 0)
/ factualityScores.length;
console.log(`Average Factuality Score: ${avgFactuality}`);
// View detailed results in Langfuse UI
if (result.datasetRunUrl) {
console.log(`View results: ${result.datasetRunUrl}`);
}The adapter provides full TypeScript type safety through the Params<E> utility type:
import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
// Type-safe parameter inference
const evaluator = createEvaluatorFromAutoevals(
Factuality,
{
model: 'gpt-4o', // ✓ Valid parameter
temperature: 0.7, // ✓ Valid parameter (if supported by Factuality)
// @ts-expect-error: input/output/expected are handled by adapter
input: "test", // ✗ Error: input is omitted from params
output: "test", // ✗ Error: output is omitted from params
expected: "test" // ✗ Error: expected is omitted from params
}
);
// The Params<E> type automatically:
// 1. Extracts parameter type from the evaluator function
// 2. Omits 'input', 'output', and 'expected' fields
// 3. Leaves only additional configuration parametersThe adapter handles evaluation failures gracefully:
import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
const result = await langfuse.experiment.run({
name: "Error Handling Test",
data: myData,
task: myTask,
evaluators: [
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
createEvaluatorFromAutoevals(Levenshtein)
]
});
// If one evaluator fails, others continue
// Failed evaluations are omitted from results
result.itemResults.forEach(item => {
console.log(`Item evaluations: ${item.evaluations.length}`);
// May have fewer evaluations if some failed
});To use the AutoEvals integration, you need:
npm install autoevalsnpm install @langfuse/client// Set up environment variables for LLM-based evaluators
// export OPENAI_API_KEY=your_openai_api_key
import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';
// LLM-based evaluators will use OPENAI_API_KEY from environment
const factualityEvaluator = createEvaluatorFromAutoevals(
Factuality,
{ model: 'gpt-4o' }
);The AutoEvals adapter provides:
Params<E> utility typelangfuse.experiment.run() and dataset.runExperiment()This adapter enables you to leverage the comprehensive suite of AutoEvals metrics without writing custom evaluation code, while maintaining full compatibility with Langfuse's experiment system.
Install with Tessl CLI
npx tessl i tessl/npm-langfuse--client@4.2.1