CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-langfuse--client

Langfuse API client for universal JavaScript environments providing observability, prompt management, datasets, experiments, and scoring capabilities

Moderation error
Malicious code detected in tile.json: This tile.json exhibits signs of a supply chain attack through typosquatting/dependency confusion. The package name 'tessl/npm-langfuse--client' (with double hyphen) mimics the legitimate '@langfuse/client' package. The 'describes' field references the real package 'pkg:npm/%40langfuse/client@4.2.0' but the tile name uses a different format with 'npm-langfuse--client', suggesting an attempt to intercept or impersonate the legitimate Langfuse client package. This pattern is consistent with dependency confusion attacks where malicious packages use similar names to legitimate ones.
Overview
Eval results
Files

autoevals-adapter.mddocs/

AutoEvals Integration

The AutoEvals Integration provides a seamless adapter for using evaluators from the AutoEvals library with Langfuse experiments. This adapter handles parameter mapping and result formatting automatically, allowing you to leverage battle-tested evaluation metrics without writing custom evaluation code.

Capabilities

createEvaluatorFromAutoevals

Convert AutoEvals evaluators to Langfuse-compatible evaluator functions with automatic parameter mapping.

/**
 * Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function
 *
 * This adapter function bridges the gap between AutoEvals library evaluators
 * and Langfuse experiment evaluators, handling parameter mapping and result
 * formatting automatically.
 *
 * AutoEvals evaluators expect `input`, `output`, and `expected` parameters,
 * while Langfuse evaluators use `input`, `output`, and `expectedOutput`.
 * This function handles the parameter name mapping transparently.
 *
 * The adapter also transforms AutoEvals result format (with `name`, `score`,
 * and `metadata` fields) to Langfuse evaluation format (with `name`, `value`,
 * and `metadata` fields).
 *
 * @template E - Type of the AutoEvals evaluator function
 * @param autoevalEvaluator - The AutoEvals evaluator function to convert
 * @param params - Optional additional parameters to pass to the AutoEvals evaluator
 * @returns A Langfuse-compatible evaluator function
 */
function createEvaluatorFromAutoevals<E extends CallableFunction>(
  autoevalEvaluator: E,
  params?: Params<E>
): Evaluator;

/**
 * Utility type to extract parameter types from AutoEvals evaluator functions
 *
 * This type helper extracts the parameter type from an AutoEvals evaluator
 * and omits the standard parameters (input, output, expected) that are
 * handled by the adapter, leaving only the additional configuration parameters.
 *
 * @template E - The AutoEvals evaluator function type
 */
type Params<E> = Parameters<
  E extends (...args: any[]) => any ? E : never
>[0] extends infer P
  ? Omit<P, "input" | "output" | "expected">
  : never;

Parameter Mapping

The adapter automatically handles the parameter name differences between AutoEvals and Langfuse:

AutoEvals ParameterLangfuse ParameterDescription
inputinputThe input data passed to the task
outputoutputThe output produced by the task
expectedexpectedOutputThe expected/ground truth output

Additional parameters specified in the params argument are passed through to the AutoEvals evaluator without modification.

Result Transformation

The adapter transforms AutoEvals results to Langfuse evaluation format:

// AutoEvals result format
{
  name: string;
  score: number;
  metadata?: Record<string, any>;
}

// Transformed to Langfuse format
{
  name: string;
  value: number;  // mapped from score, defaults to 0 if undefined
  metadata?: Record<string, any>;
}

Usage Examples

Basic Usage

Use AutoEvals evaluators directly with Langfuse experiments:

import { Factuality, Levenshtein } from 'autoevals';
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';

const langfuse = new LangfuseClient();

// Create wrapped evaluators
const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);

// Use in experiment
const result = await langfuse.experiment.run({
  name: "Capital Cities Test",
  data: [
    { input: "France", expectedOutput: "Paris" },
    { input: "Germany", expectedOutput: "Berlin" },
    { input: "Japan", expectedOutput: "Tokyo" }
  ],
  task: async ({ input }) => {
    const response = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [{
        role: "user",
        content: `What is the capital of ${input}?`
      }]
    });
    return response.choices[0].message.content;
  },
  evaluators: [factualityEvaluator, levenshteinEvaluator]
});

console.log(await result.format());

With Additional Parameters

Pass configuration parameters to AutoEvals evaluators:

import { Factuality, ClosedQA, Battle } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

// Configure Factuality evaluator with custom model
const factualityEvaluator = createEvaluatorFromAutoevals(
  Factuality,
  { model: 'gpt-4o' }
);

// Configure ClosedQA with model and chain-of-thought
const closedQAEvaluator = createEvaluatorFromAutoevals(
  ClosedQA,
  {
    model: 'gpt-4-turbo',
    useCoT: true  // Enable chain of thought reasoning
  }
);

// Configure Battle evaluator for model comparison
const battleEvaluator = createEvaluatorFromAutoevals(
  Battle,
  {
    model: 'gpt-4',
    instructions: 'Compare which response is more accurate and helpful'
  }
);

await langfuse.experiment.run({
  name: "Configured Evaluators Test",
  data: qaDataset,
  task: myTask,
  evaluators: [
    factualityEvaluator,
    closedQAEvaluator,
    battleEvaluator
  ]
});

Common AutoEvals Evaluators

Examples using popular AutoEvals evaluators:

import {
  Factuality,
  Levenshtein,
  ClosedQA,
  Battle,
  Humor,
  Security,
  Sql,
  ValidJson,
  AnswerRelevancy
} from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

// Text similarity and accuracy
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);

// Factuality checking (requires OpenAI)
const factualityEvaluator = createEvaluatorFromAutoevals(
  Factuality,
  { model: 'gpt-4o' }
);

// Closed-domain QA evaluation
const closedQAEvaluator = createEvaluatorFromAutoevals(
  ClosedQA,
  { model: 'gpt-4o' }
);

// Model comparison
const battleEvaluator = createEvaluatorFromAutoevals(
  Battle,
  { model: 'gpt-4' }
);

// Humor detection
const humorEvaluator = createEvaluatorFromAutoevals(
  Humor,
  { model: 'gpt-4o' }
);

// Security checking
const securityEvaluator = createEvaluatorFromAutoevals(
  Security,
  { model: 'gpt-4o' }
);

// SQL validation
const sqlEvaluator = createEvaluatorFromAutoevals(Sql);

// JSON validation
const jsonEvaluator = createEvaluatorFromAutoevals(ValidJson);

// Answer relevancy
const relevancyEvaluator = createEvaluatorFromAutoevals(
  AnswerRelevancy,
  { model: 'gpt-4o' }
);

// Use multiple evaluators for comprehensive assessment
await langfuse.experiment.run({
  name: "Comprehensive QA Evaluation",
  data: qaDataset,
  task: qaTask,
  evaluators: [
    levenshteinEvaluator,
    factualityEvaluator,
    closedQAEvaluator,
    relevancyEvaluator
  ]
});

With Langfuse Datasets

Use AutoEvals evaluators when running experiments on Langfuse datasets:

import { Factuality, Levenshtein } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

const langfuse = new LangfuseClient();

// Fetch dataset from Langfuse
const dataset = await langfuse.dataset.get("qa-evaluation-dataset");

// Run experiment with AutoEvals evaluators
const result = await dataset.runExperiment({
  name: "GPT-4 QA Evaluation",
  description: "Evaluating GPT-4 performance on QA dataset",
  task: async ({ input }) => {
    const response = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [{ role: "user", content: input }]
    });
    return response.choices[0].message.content;
  },
  evaluators: [
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
    createEvaluatorFromAutoevals(Levenshtein)
  ]
});

console.log(`Dataset Run URL: ${result.datasetRunUrl}`);
console.log(await result.format());

Combining AutoEvals and Custom Evaluators

Mix AutoEvals evaluators with your own custom evaluation logic:

import { Factuality, Levenshtein } from 'autoevals';
import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';

// Custom evaluator
const exactMatchEvaluator: Evaluator = async ({ output, expectedOutput }) => ({
  name: "exact_match",
  value: output === expectedOutput ? 1 : 0,
  comment: output === expectedOutput ? "Perfect match" : "No match"
});

// Custom evaluator with metadata
const lengthEvaluator: Evaluator = async ({ output, expectedOutput }) => {
  const outputLen = output?.length || 0;
  const expectedLen = expectedOutput?.length || 0;
  const lengthDiff = Math.abs(outputLen - expectedLen);

  return {
    name: "length_similarity",
    value: 1 - (lengthDiff / Math.max(outputLen, expectedLen, 1)),
    metadata: {
      outputLength: outputLen,
      expectedLength: expectedLen,
      difference: lengthDiff
    }
  };
};

// Custom multi-evaluation evaluator
const comprehensiveCustomEvaluator: Evaluator = async ({
  input,
  output,
  expectedOutput
}) => {
  return [
    {
      name: "contains_expected",
      value: output.includes(expectedOutput) ? 1 : 0
    },
    {
      name: "case_sensitive_match",
      value: output === expectedOutput ? 1 : 0
    },
    {
      name: "case_insensitive_match",
      value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0
    }
  ];
};

// Combine everything
await langfuse.experiment.run({
  name: "Mixed Evaluators Experiment",
  data: dataset,
  task: myTask,
  evaluators: [
    // AutoEvals evaluators
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
    createEvaluatorFromAutoevals(Levenshtein),
    // Custom evaluators
    exactMatchEvaluator,
    lengthEvaluator,
    comprehensiveCustomEvaluator
  ]
});

Advanced: Domain-Specific Evaluations

Configure AutoEvals evaluators for specific domains:

import { Factuality, ClosedQA, Security } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

// Medical QA evaluation
const medicalQAEvaluators = [
  createEvaluatorFromAutoevals(Factuality, {
    model: 'gpt-4o',
    // Additional context can be provided through metadata
  }),
  createEvaluatorFromAutoevals(ClosedQA, {
    model: 'gpt-4-turbo',
    useCoT: true
  })
];

await langfuse.experiment.run({
  name: "Medical QA Evaluation",
  description: "Evaluating medical question answering accuracy",
  data: medicalQADataset,
  task: medicalQATask,
  evaluators: medicalQAEvaluators
});

// Code generation evaluation
const codeGenerationEvaluators = [
  createEvaluatorFromAutoevals(Security, { model: 'gpt-4o' }),
  createEvaluatorFromAutoevals(ValidJson), // If generating JSON
  createEvaluatorFromAutoevals(Sql) // If generating SQL
];

await langfuse.experiment.run({
  name: "Code Generation Quality",
  description: "Evaluating generated code for security and validity",
  data: codeGenDataset,
  task: codeGenTask,
  evaluators: codeGenerationEvaluators
});

// Creative writing evaluation
const creativeWritingEvaluators = [
  createEvaluatorFromAutoevals(Humor, { model: 'gpt-4o' }),
  createEvaluatorFromAutoevals(AnswerRelevancy, { model: 'gpt-4o' })
];

await langfuse.experiment.run({
  name: "Creative Writing Assessment",
  description: "Evaluating creative writing quality",
  data: writingPromptsDataset,
  task: writingTask,
  evaluators: creativeWritingEvaluators
});

Parallel Evaluation with Concurrency Control

Run experiments with AutoEvals evaluators and concurrency limits:

import { Factuality, Levenshtein, ClosedQA } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

const result = await langfuse.experiment.run({
  name: "Large Scale Evaluation",
  data: largeDataset, // 1000+ items
  task: myTask,
  evaluators: [
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
    createEvaluatorFromAutoevals(Levenshtein),
    createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })
  ],
  maxConcurrency: 10 // Limit concurrent task executions
});

// Evaluators run in parallel for each item
// But only 10 items are processed concurrently
console.log(`Processed ${result.itemResults.length} items`);
console.log(await result.format());

Integration Patterns

Pattern 1: Standard AutoEvals Integration

The most common pattern for using AutoEvals evaluators:

import { Factuality, Levenshtein } from 'autoevals';
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';

const langfuse = new LangfuseClient();

// Step 1: Wrap AutoEvals evaluators
const evaluators = [
  createEvaluatorFromAutoevals(Factuality),
  createEvaluatorFromAutoevals(Levenshtein)
];

// Step 2: Run experiment
const result = await langfuse.experiment.run({
  name: "My Experiment",
  data: myData,
  task: myTask,
  evaluators
});

// Step 3: Review results
console.log(await result.format());

Pattern 2: Configured AutoEvals Integration

Use when you need to pass custom parameters to AutoEvals:

import { Factuality, ClosedQA } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

// Configure evaluators with custom parameters
const evaluators = [
  createEvaluatorFromAutoevals(Factuality, {
    model: 'gpt-4o',
    // model will be passed to AutoEvals Factuality evaluator
  }),
  createEvaluatorFromAutoevals(ClosedQA, {
    model: 'gpt-4-turbo',
    useCoT: true
  })
];

await langfuse.experiment.run({
  name: "Configured Evaluation",
  data: myData,
  task: myTask,
  evaluators
});

Pattern 3: Hybrid Evaluation Strategy

Combine AutoEvals evaluators with custom evaluation logic:

import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';

const hybridEvaluators = [
  // Use AutoEvals for complex evaluations
  createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

  // Use custom evaluators for domain-specific logic
  async ({ output, expectedOutput, metadata }): Promise<Evaluation> => ({
    name: "business_rule_check",
    value: checkBusinessRules(output, metadata) ? 1 : 0,
    comment: "Domain-specific business rule validation"
  })
];

await langfuse.experiment.run({
  name: "Hybrid Evaluation",
  data: myData,
  task: myTask,
  evaluators: hybridEvaluators
});

Pattern 4: Progressive Evaluation

Start with simple evaluators and add more complex ones:

import { Levenshtein, Factuality, ClosedQA } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

// Phase 1: Quick evaluation with simple metrics
const quickEvaluators = [
  createEvaluatorFromAutoevals(Levenshtein)
];

const quickResult = await langfuse.experiment.run({
  name: "Quick Evaluation - Phase 1",
  data: myData,
  task: myTask,
  evaluators: quickEvaluators
});

// Analyze quick results...
console.log(await quickResult.format());

// Phase 2: Deep evaluation with LLM-based metrics
const deepEvaluators = [
  createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
  createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o', useCoT: true })
];

const deepResult = await langfuse.experiment.run({
  name: "Deep Evaluation - Phase 2",
  data: myData,
  task: myTask,
  evaluators: deepEvaluators
});

console.log(await deepResult.format());

Best Practices

1. Choose Appropriate Evaluators

Select AutoEvals evaluators that match your evaluation needs:

// For factual accuracy - use Factuality
createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' })

// For text similarity - use Levenshtein
createEvaluatorFromAutoevals(Levenshtein)

// For closed-domain QA - use ClosedQA
createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })

// For comparing two outputs - use Battle
createEvaluatorFromAutoevals(Battle, { model: 'gpt-4' })

// For code validation - use Sql, ValidJson, etc.
createEvaluatorFromAutoevals(ValidJson)

2. Configure Model Parameters

Always specify model parameters for LLM-based AutoEvals evaluators:

// Good: Explicit model configuration
const evaluator = createEvaluatorFromAutoevals(Factuality, {
  model: 'gpt-4o'
});

// Less ideal: Relying on defaults (may vary)
const evaluator = createEvaluatorFromAutoevals(Factuality);

3. Mix Evaluator Types

Combine different types of evaluators for comprehensive assessment:

const evaluators = [
  // Fast, deterministic evaluators
  createEvaluatorFromAutoevals(Levenshtein),
  createEvaluatorFromAutoevals(ValidJson),

  // LLM-based evaluators
  createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
  createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' }),

  // Custom domain-specific evaluators
  customBusinessLogicEvaluator
];

4. Handle Evaluation Costs

Be mindful of API costs when using LLM-based AutoEvals evaluators:

// For large datasets, start with cheaper evaluators
const result = await langfuse.experiment.run({
  name: "Cost-Conscious Evaluation",
  data: largeDataset,
  task: myTask,
  evaluators: [
    // Free/cheap evaluators
    createEvaluatorFromAutoevals(Levenshtein),

    // Use GPT-4 selectively or use cheaper models
    createEvaluatorFromAutoevals(Factuality, {
      model: 'gpt-3.5-turbo'  // Cheaper alternative
    })
  ],
  maxConcurrency: 5 // Control rate limiting
});

5. Understand Parameter Mapping

Remember that the adapter automatically maps parameters:

// Your Langfuse data
const data = [
  {
    input: "What is 2+2?",
    expectedOutput: "4"  // Note: expectedOutput (Langfuse format)
  }
];

// AutoEvals receives:
// {
//   input: "What is 2+2?",
//   output: <task result>,
//   expected: "4"  // Automatically mapped from expectedOutput
// }

const evaluator = createEvaluatorFromAutoevals(Factuality);

6. Test Evaluators Individually

Test AutoEvals evaluators with sample data before full experiments:

import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

// Create evaluator
const factualityEvaluator = createEvaluatorFromAutoevals(
  Factuality,
  { model: 'gpt-4o' }
);

// Test with sample data
const testResult = await langfuse.experiment.run({
  name: "Evaluator Test",
  data: [
    { input: "Test input", expectedOutput: "Test output" }
  ],
  task: async () => "Test result",
  evaluators: [factualityEvaluator]
});

console.log(await testResult.format());
// Verify evaluator works as expected before scaling up

7. Monitor Evaluation Results

Track evaluation scores across experiments:

const result = await langfuse.experiment.run({
  name: "Production Evaluation",
  data: productionDataset,
  task: productionTask,
  evaluators: [
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
    createEvaluatorFromAutoevals(Levenshtein)
  ]
});

// Analyze scores
const factualityScores = result.itemResults
  .flatMap(r => r.evaluations)
  .filter(e => e.name === 'Factuality')
  .map(e => e.value);

const avgFactuality = factualityScores.reduce((a, b) => a + b, 0)
  / factualityScores.length;

console.log(`Average Factuality Score: ${avgFactuality}`);

// View detailed results in Langfuse UI
if (result.datasetRunUrl) {
  console.log(`View results: ${result.datasetRunUrl}`);
}

Type Safety

The adapter provides full TypeScript type safety through the Params<E> utility type:

import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

// Type-safe parameter inference
const evaluator = createEvaluatorFromAutoevals(
  Factuality,
  {
    model: 'gpt-4o',  // ✓ Valid parameter
    temperature: 0.7,  // ✓ Valid parameter (if supported by Factuality)
    // @ts-expect-error: input/output/expected are handled by adapter
    input: "test",  // ✗ Error: input is omitted from params
    output: "test", // ✗ Error: output is omitted from params
    expected: "test" // ✗ Error: expected is omitted from params
  }
);

// The Params<E> type automatically:
// 1. Extracts parameter type from the evaluator function
// 2. Omits 'input', 'output', and 'expected' fields
// 3. Leaves only additional configuration parameters

Error Handling

The adapter handles evaluation failures gracefully:

import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

const result = await langfuse.experiment.run({
  name: "Error Handling Test",
  data: myData,
  task: myTask,
  evaluators: [
    createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),
    createEvaluatorFromAutoevals(Levenshtein)
  ]
});

// If one evaluator fails, others continue
// Failed evaluations are omitted from results
result.itemResults.forEach(item => {
  console.log(`Item evaluations: ${item.evaluations.length}`);
  // May have fewer evaluations if some failed
});

Requirements

To use the AutoEvals integration, you need:

  1. Install AutoEvals: npm install autoevals
  2. Install Langfuse Client: npm install @langfuse/client
  3. API Keys: Configure API keys for LLM-based evaluators (e.g., OpenAI API key for Factuality, ClosedQA, etc.)
// Set up environment variables for LLM-based evaluators
// export OPENAI_API_KEY=your_openai_api_key

import { Factuality } from 'autoevals';
import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';

// LLM-based evaluators will use OPENAI_API_KEY from environment
const factualityEvaluator = createEvaluatorFromAutoevals(
  Factuality,
  { model: 'gpt-4o' }
);

Related Documentation

  • Experiment Execution - Complete experiment system documentation
  • Evaluator Types - Understanding evaluator functions
  • Dataset Management - Working with Langfuse datasets
  • AutoEvals Library - Official AutoEvals documentation

Summary

The AutoEvals adapter provides:

  • Automatic Parameter Mapping: Transparently maps Langfuse parameters to AutoEvals format
  • Result Transformation: Converts AutoEvals results to Langfuse evaluation format
  • Type Safety: Full TypeScript support with the Params<E> utility type
  • Seamless Integration: Works with both langfuse.experiment.run() and dataset.runExperiment()
  • Flexible Configuration: Pass custom parameters to AutoEvals evaluators
  • Hybrid Evaluation: Mix AutoEvals and custom evaluators in the same experiment

This adapter enables you to leverage the comprehensive suite of AutoEvals metrics without writing custom evaluation code, while maintaining full compatibility with Langfuse's experiment system.

Install with Tessl CLI

npx tessl i tessl/npm-langfuse--client@4.2.1

docs

autoevals-adapter.md

client.md

datasets.md

experiments.md

index.md

media.md

prompts.md

scores.md

tile.json