tessl/npm-langfuse--client

Langfuse API client for universal JavaScript environments providing observability, prompt management, datasets, experiments, and scoring capabilities

Moderation error

Malicious code detected in tile.json: This tile.json exhibits signs of a supply chain attack through typosquatting/dependency confusion. The package name 'tessl/npm-langfuse--client' (with double hyphen) mimics the legitimate '@langfuse/client' package. The 'describes' field references the real package 'pkg:npm/%40langfuse/client@4.2.0' but the tile name uses a different format with 'npm-langfuse--client', suggesting an attempt to intercept or impersonate the legitimate Langfuse client package. This pattern is consistent with dependency confusion attacks where malicious packages use similar names to legitimate ones.

Overview

Eval results

Files

Experiment Execution

Name: tessl/npm-langfuse--client
Author: tessl

The Experiment Execution system provides a comprehensive framework for running experiments that test models or tasks against datasets, with support for automatic evaluation, scoring, tracing, and result analysis. It enables systematic testing, comparison, and evaluation of AI models and prompts.

Capabilities

Run Experiment

Execute an experiment by running a task on each data item and evaluating the results with full tracing integration.

/**
 * Executes an experiment by running a task on each data item and evaluating the results
 *
 * This method orchestrates the complete experiment lifecycle:
 * 1. Executes the task function on each data item with proper tracing
 * 2. Runs item-level evaluators on each task output
 * 3. Executes run-level evaluators on the complete result set
 * 4. Links results to dataset runs (for Langfuse datasets)
 * 5. Stores all scores and traces in Langfuse
 *
 * @param config - The experiment configuration
 * @returns Promise that resolves to experiment results including itemResults, runEvaluations, and format function
 */
run<Input = any, ExpectedOutput = any, Metadata extends Record<string, any> = Record<string, any>>(
  config: ExperimentParams<Input, ExpectedOutput, Metadata>
): Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>;

Usage Examples:

import { LangfuseClient } from '@langfuse/client';
import OpenAI from 'openai';

const langfuse = new LangfuseClient();
const openai = new OpenAI();

// Basic experiment with custom data
const result = await langfuse.experiment.run({
  name: "Capital Cities Test",
  description: "Testing model knowledge of world capitals",
  data: [
    { input: "France", expectedOutput: "Paris" },
    { input: "Germany", expectedOutput: "Berlin" },
    { input: "Japan", expectedOutput: "Tokyo" }
  ],
  task: async ({ input }) => {
    const response = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [{
        role: "user",
        content: `What is the capital of ${input}?`
      }]
    });
    return response.choices[0].message.content;
  },
  evaluators: [
    async ({ output, expectedOutput }) => ({
      name: "exact_match",
      value: output === expectedOutput ? 1 : 0
    })
  ]
});

console.log(await result.format());

// Experiment on Langfuse dataset
const dataset = await langfuse.dataset.get("qa-dataset");

const datasetResult = await dataset.runExperiment({
  name: "GPT-4 QA Evaluation",
  description: "Testing GPT-4 on our QA dataset",
  task: async ({ input }) => {
    const response = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [{ role: "user", content: input }]
    });
    return response.choices[0].message.content;
  },
  evaluators: [
    async ({ output, expectedOutput }) => ({
      name: "accuracy",
      value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0,
      comment: output === expectedOutput ? "Perfect match" : "Case-insensitive match"
    })
  ]
});

// Multiple evaluators
const multiEvalResult = await langfuse.experiment.run({
  name: "Translation Quality Test",
  data: [
    { input: "Hello world", expectedOutput: "Hola mundo" },
    { input: "Good morning", expectedOutput: "Buenos días" }
  ],
  task: async ({ input }) => translateText(input, 'es'),
  evaluators: [
    // Evaluator 1: Exact match
    async ({ output, expectedOutput }) => ({
      name: "exact_match",
      value: output === expectedOutput ? 1 : 0
    }),
    // Evaluator 2: BLEU score
    async ({ output, expectedOutput }) => ({
      name: "bleu_score",
      value: calculateBleuScore(output, expectedOutput),
      comment: "Translation quality metric"
    }),
    // Evaluator 3: Length similarity
    async ({ output, expectedOutput }) => ({
      name: "length_similarity",
      value: Math.abs(output.length - expectedOutput.length) <= 5 ? 1 : 0
    })
  ]
});

// Run-level evaluators for aggregate analysis
const aggregateResult = await langfuse.experiment.run({
  name: "Sentiment Classification",
  data: sentimentDataset,
  task: classifysentiment,
  evaluators: [
    async ({ output, expectedOutput }) => ({
      name: "accuracy",
      value: output === expectedOutput ? 1 : 0
    })
  ],
  runEvaluators: [
    // Average accuracy across all items
    async ({ itemResults }) => {
      const accuracyScores = itemResults
        .flatMap(r => r.evaluations)
        .filter(e => e.name === "accuracy")
        .map(e => e.value as number);

      const average = accuracyScores.reduce((a, b) => a + b, 0) / accuracyScores.length;

      return {
        name: "average_accuracy",
        value: average,
        comment: `Overall accuracy: ${(average * 100).toFixed(1)}%`
      };
    },
    // Precision calculation
    async ({ itemResults }) => {
      let truePositives = 0;
      let falsePositives = 0;

      for (const result of itemResults) {
        if (result.output === "positive") {
          if (result.expectedOutput === "positive") {
            truePositives++;
          } else {
            falsePositives++;
          }
        }
      }

      const precision = truePositives / (truePositives + falsePositives);

      return {
        name: "precision",
        value: precision,
        comment: `Precision for positive class: ${(precision * 100).toFixed(1)}%`
      };
    }
  ]
});

// Concurrency control with maxConcurrency
const largeScaleResult = await langfuse.experiment.run({
  name: "Large Scale Evaluation",
  description: "Processing 1000 items with rate limiting",
  data: largeDataset,
  task: expensiveModelCall,
  maxConcurrency: 5, // Process max 5 items simultaneously
  evaluators: [accuracyEvaluator]
});

// Custom run name
const customRunResult = await langfuse.experiment.run({
  name: "Model Comparison",
  runName: "gpt-4-turbo-2024-01-15",
  description: "Testing latest GPT-4 Turbo model",
  data: testData,
  task: myTask,
  evaluators: [myEvaluator]
});

// With metadata
const metadataResult = await langfuse.experiment.run({
  name: "Parameter Sweep",
  metadata: {
    model: "gpt-4",
    temperature: 0.7,
    max_tokens: 1000,
    experiment_version: "v2.1"
  },
  data: testData,
  task: myTask,
  evaluators: [myEvaluator]
});

// Formatting results
const formattedResult = await langfuse.experiment.run({
  name: "Test Run",
  data: testData,
  task: myTask,
  evaluators: [myEvaluator]
});

// Format summary only (default)
console.log(await formattedResult.format());

// Format with detailed item results
console.log(await formattedResult.format({ includeItemResults: true }));

// Access raw results
console.log(`Processed ${formattedResult.itemResults.length} items`);
console.log(`Run evaluations:`, formattedResult.runEvaluations);
console.log(`Dataset run URL:`, formattedResult.datasetRunUrl);

OpenTelemetry Integration:

The experiment system automatically integrates with OpenTelemetry for distributed tracing:

import { LangfuseClient } from '@langfuse/client';
import { LangfuseTraceClient } from '@langfuse/tracing';

// Ensure OpenTelemetry is configured
const langfuse = new LangfuseClient();

// Experiments automatically create traces for each task execution
const result = await langfuse.experiment.run({
  name: "Traced Experiment",
  data: testData,
  task: async ({ input }) => {
    // This task execution is automatically wrapped in a trace
    // with name "experiment-item-run"
    const output = await processInput(input);
    return output;
  },
  evaluators: [myEvaluator]
});

// Each item result includes trace information
for (const itemResult of result.itemResults) {
  console.log(`Trace ID: ${itemResult.traceId}`);
  const traceUrl = await langfuse.getTraceUrl(itemResult.traceId);
  console.log(`View trace: ${traceUrl}`);
}

// Warning if OpenTelemetry is not set up
// The system will log:
// "OpenTelemetry has not been set up. Traces will not be sent to Langfuse."

Error Handling:

// Task errors are caught and logged
const resilientResult = await langfuse.experiment.run({
  name: "Resilient Experiment",
  data: testData,
  task: async ({ input }) => {
    try {
      return await riskyOperation(input);
    } catch (error) {
      // Task errors are caught, logged, and item is skipped
      throw error;
    }
  },
  evaluators: [
    async ({ output, expectedOutput }) => {
      try {
        return {
          name: "score",
          value: calculateScore(output, expectedOutput)
        };
      } catch (error) {
        // Evaluator errors are caught and logged
        // Other evaluators continue to run
        throw error;
      }
    }
  ]
});

// Result contains only successfully processed items
console.log(`Successfully processed: ${resilientResult.itemResults.length} items`);

// Run evaluators also handle errors gracefully
const robustResult = await langfuse.experiment.run({
  name: "Robust Experiment",
  data: testData,
  task: myTask,
  evaluators: [myEvaluator],
  runEvaluators: [
    async ({ itemResults }) => {
      try {
        return {
          name: "aggregate_metric",
          value: calculateAggregate(itemResults)
        };
      } catch (error) {
        // Run evaluator errors are caught and logged
        // Other run evaluators continue to run
        throw error;
      }
    }
  ]
});

Type Definitions

ExperimentParams

Configuration parameters for experiment execution.

type ExperimentParams<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = {
  /**
   * Human-readable name for the experiment.
   *
   * This name will appear in Langfuse UI and experiment results.
   * Choose a descriptive name that identifies the experiment's purpose.
   */
  name: string;

  /**
   * Optional exact name for the experiment run.
   *
   * If provided, this will be used as the exact dataset run name if the data
   * contains Langfuse dataset items. If not provided, this will default to
   * the experiment name appended with an ISO timestamp.
   */
  runName?: string;

  /**
   * Optional description explaining the experiment's purpose.
   *
   * Provide context about what you're testing, methodology, or goals.
   * This helps with experiment tracking and result interpretation.
   */
  description?: string;

  /**
   * Optional metadata to attach to the experiment run.
   *
   * Store additional context like model versions, hyperparameters,
   * or any other relevant information for analysis and comparison.
   */
  metadata?: Record<string, any>;

  /**
   * Array of data items to process.
   *
   * Can be either custom ExperimentItem[] or DatasetItem[] from Langfuse.
   * Each item should contain input data and optionally expected output.
   */
  data: ExperimentItem<Input, ExpectedOutput, Metadata>[];

  /**
   * The task function to execute on each data item.
   *
   * This function receives input data and produces output that will be evaluated.
   * It should encapsulate the model or system being tested.
   */
  task: ExperimentTask<Input, ExpectedOutput, Metadata>;

  /**
   * Optional array of evaluator functions to assess each item's output.
   *
   * Each evaluator receives input, output, and expected output (if available)
   * and returns evaluation results. Multiple evaluators enable comprehensive assessment.
   */
  evaluators?: Evaluator<Input, ExpectedOutput, Metadata>[];

  /**
   * Optional array of run-level evaluators to assess the entire experiment.
   *
   * These evaluators receive all item results and can perform aggregate analysis
   * like calculating averages, detecting patterns, or statistical analysis.
   */
  runEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[];

  /**
   * Maximum number of concurrent task executions (default: Infinity).
   *
   * Controls parallelism to manage resource usage and API rate limits.
   * Set lower values for expensive operations or rate-limited services.
   */
  maxConcurrency?: number;
};

Usage Examples:

import type { ExperimentParams } from '@langfuse/client';

// Type-safe experiment configuration
const config: ExperimentParams<string, string> = {
  name: "Capital Cities",
  description: "Testing geography knowledge",
  metadata: {
    model: "gpt-4",
    version: "v1.0"
  },
  data: [
    { input: "France", expectedOutput: "Paris" },
    { input: "Germany", expectedOutput: "Berlin" }
  ],
  task: async ({ input }) => {
    return await getCapital(input);
  },
  evaluators: [exactMatchEvaluator],
  runEvaluators: [averageScoreEvaluator],
  maxConcurrency: 3
};

await langfuse.experiment.run(config);

// Generic types for complex data
interface CustomInput {
  question: string;
  context: string[];
}

interface CustomOutput {
  answer: string;
  confidence: number;
}

interface CustomMetadata {
  category: string;
  difficulty: "easy" | "medium" | "hard";
}

const typedConfig: ExperimentParams<CustomInput, CustomOutput, CustomMetadata> = {
  name: "QA with Context",
  data: [
    {
      input: {
        question: "What is AI?",
        context: ["AI stands for Artificial Intelligence"]
      },
      expectedOutput: {
        answer: "Artificial Intelligence",
        confidence: 0.95
      },
      metadata: {
        category: "technology",
        difficulty: "easy"
      }
    }
  ],
  task: async ({ input }) => {
    // Type-safe input and output
    return await qaModel(input.question, input.context);
  },
  evaluators: [
    async ({ input, output, expectedOutput }) => {
      // All parameters are fully typed
      return {
        name: "accuracy",
        value: output.answer === expectedOutput?.answer ? 1 : 0
      };
    }
  ]
};

ExperimentTask

Function type for experiment tasks that process input data and return output.

/**
 * Function type for experiment tasks that process input data and return output
 *
 * The task function is the core component being tested in an experiment.
 * It receives either an ExperimentItem or DatasetItem and produces output
 * that will be evaluated.
 *
 * @param params - Either an ExperimentItem or DatasetItem containing input and metadata
 * @returns Promise resolving to the task's output (any type)
 */
type ExperimentTask<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = (
  params: ExperimentTaskParams<Input, ExpectedOutput, Metadata>
) => Promise<any>;

type ExperimentTaskParams<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = ExperimentItem<Input, ExpectedOutput, Metadata>;

Usage Examples:

import type { ExperimentTask } from '@langfuse/client';

// Simple task function
const simpleTask: ExperimentTask = async ({ input }) => {
  return await processInput(input);
};

// Task with type safety
const typedTask: ExperimentTask<string, string> = async ({ input, metadata }) => {
  // input is typed as string
  // metadata is typed as Record<string, any>
  return await processString(input);
};

// Task accessing expected output (for reference)
const referenceTask: ExperimentTask = async ({ input, expectedOutput }) => {
  // Can access expectedOutput for context (but shouldn't use it for cheating!)
  console.log(`Processing input, expecting: ${expectedOutput}`);
  return await myModel(input);
};

// Task with custom types
interface QuestionInput {
  question: string;
  context: string;
}

const qaTask: ExperimentTask<QuestionInput, string> = async ({ input, metadata }) => {
  const { question, context } = input;
  return await answerQuestion(question, context);
};

// Task handling both ExperimentItem and DatasetItem
const universalTask: ExperimentTask = async (item) => {
  // Works with both types
  const input = item.input;
  const meta = item.metadata || {};

  // Check if it's a DatasetItem (has id and datasetId)
  if ('id' in item && 'datasetId' in item) {
    console.log(`Processing dataset item: ${item.id}`);
  }

  return await process(input, meta);
};

// Task with error handling
const robustTask: ExperimentTask = async ({ input }) => {
  try {
    return await riskyOperation(input);
  } catch (error) {
    console.error(`Task failed for input:`, input, error);
    throw error; // Re-throw to skip this item
  }
};

// Task with nested tracing
const tracedTask: ExperimentTask = async ({ input }) => {
  // Nested operations are automatically traced
  const step1 = await preprocessInput(input);
  const step2 = await modelInference(step1);
  const step3 = await postprocess(step2);
  return step3;
};

ExperimentItem

Data item type for experiment inputs, supporting both custom items and Langfuse dataset items.

/**
 * Experiment data item or dataset item
 *
 * Can be either a custom item with input/expectedOutput/metadata
 * or a DatasetItem from Langfuse
 */
type ExperimentItem<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> =
  | {
      /**
       * The input data to pass to the task function.
       *
       * Can be any type - string, object, array, etc. This data will be passed
       * to your task function as the `input` parameter.
       */
      input?: Input;

      /**
       * The expected output for evaluation purposes.
       *
       * Optional ground truth or reference output for this input.
       * Used by evaluators to assess task performance.
       */
      expectedOutput?: ExpectedOutput;

      /**
       * Optional metadata to attach to the experiment item.
       *
       * Store additional context, tags, or custom data related to this specific item.
       * This metadata will be available in traces and evaluators.
       */
      metadata?: Metadata;
    }
  | DatasetItem;

Usage Examples:

import type { ExperimentItem } from '@langfuse/client';

// Simple string items
const stringItems: ExperimentItem<string, string>[] = [
  { input: "Hello", expectedOutput: "Hola" },
  { input: "Goodbye", expectedOutput: "Adiós" }
];

// Complex structured items
interface QAInput {
  question: string;
  context: string;
}

const qaItems: ExperimentItem<QAInput, string>[] = [
  {
    input: {
      question: "What is AI?",
      context: "AI stands for Artificial Intelligence..."
    },
    expectedOutput: "Artificial Intelligence"
  }
];

// Items with metadata
const itemsWithMetadata: ExperimentItem<string, string, { category: string }>[] = [
  {
    input: "Test input",
    expectedOutput: "Expected output",
    metadata: {
      category: "technology"
    }
  }
];

// Items without expected output (evaluation-only based on output)
const noExpectedOutput: ExperimentItem<string, never>[] = [
  { input: "Generate creative text" }
  // No expectedOutput - evaluators won't have ground truth
];

// Mixed with Langfuse dataset items
const dataset = await langfuse.dataset.get("my-dataset");
const mixedItems: ExperimentItem[] = [
  // Custom items
  { input: "Custom input", expectedOutput: "Custom output" },
  // Dataset items
  ...dataset.items
];

// Accessing item properties in task
const task: ExperimentTask = async (item) => {
  if ('id' in item && 'datasetId' in item) {
    // It's a DatasetItem
    console.log(`Dataset item ID: ${item.id}`);
    console.log(`Dataset ID: ${item.datasetId}`);
  }

  return await process(item.input);
};

Evaluator

Function type for item-level evaluators that assess individual task outputs.

/**
 * Evaluator function for item-level evaluation
 *
 * Receives input, output, expected output, and metadata,
 * and returns evaluation results as Evaluation object(s).
 *
 * @param params - Parameters including input, output, expectedOutput, and metadata
 * @returns Promise resolving to single Evaluation or array of Evaluations
 */
type Evaluator<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = (
  params: EvaluatorParams<Input, ExpectedOutput, Metadata>
) => Promise<Evaluation[] | Evaluation>;

type EvaluatorParams<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = {
  /**
   * The original input data passed to the task.
   *
   * Use this for context-aware evaluations or input-output relationship analysis.
   */
  input: Input;

  /**
   * The output produced by the task.
   *
   * This is the actual result returned by your task function.
   */
  output: any;

  /**
   * The expected output for comparison (optional).
   *
   * This is the ground truth or expected result for the given input.
   */
  expectedOutput?: ExpectedOutput;
};

Usage Examples:

import type { Evaluator, Evaluation } from '@langfuse/client';

// Simple exact match evaluator
const exactMatch: Evaluator = async ({ output, expectedOutput }) => ({
  name: "exact_match",
  value: output === expectedOutput ? 1 : 0
});

// Case-insensitive match with comment
const caseInsensitiveMatch: Evaluator = async ({ output, expectedOutput }) => {
  const match = output.toLowerCase() === expectedOutput?.toLowerCase();
  return {
    name: "case_insensitive_match",
    value: match ? 1 : 0,
    comment: match ? "Perfect match" : "No match"
  };
};

// Evaluator returning multiple scores
const comprehensiveEvaluator: Evaluator = async ({ output, expectedOutput }) => {
  return [
    {
      name: "exact_match",
      value: output === expectedOutput ? 1 : 0
    },
    {
      name: "length_match",
      value: Math.abs(output.length - expectedOutput.length) <= 5 ? 1 : 0
    },
    {
      name: "similarity",
      value: calculateSimilarity(output, expectedOutput),
      comment: "Cosine similarity score"
    }
  ];
};

// Type-safe evaluator
const typedEvaluator: Evaluator<string, string> = async ({ input, output, expectedOutput }) => {
  // All parameters are typed
  return {
    name: "accuracy",
    value: output === expectedOutput ? 1 : 0,
    metadata: { input_length: input.length }
  };
};

// Evaluator using input context
const contextAwareEvaluator: Evaluator = async ({ input, output }) => {
  const isValid = validateOutput(output, input);
  return {
    name: "validity",
    value: isValid ? 1 : 0,
    comment: isValid ? "Output valid for input" : "Output invalid"
  };
};

// Evaluator with metadata
const categoryEvaluator: Evaluator<any, any, { category: string }> = async ({
  output,
  expectedOutput,
  metadata
}) => {
  const score = calculateScore(output, expectedOutput);
  return {
    name: "category_score",
    value: score,
    metadata: {
      category: metadata?.category,
      timestamp: new Date().toISOString()
    }
  };
};

// Evaluator with different data types
const numericEvaluator: Evaluator = async ({ output, expectedOutput }) => {
  const error = Math.abs(output - expectedOutput);
  return {
    name: "absolute_error",
    value: error,
    dataType: "numeric",
    comment: `Error: ${error.toFixed(2)}`
  };
};

// Boolean evaluator
const booleanEvaluator: Evaluator = async ({ output }) => {
  return {
    name: "is_valid",
    value: validateFormat(output),
    dataType: "boolean"
  };
};

// Evaluator with error handling
const robustEvaluator: Evaluator = async ({ output, expectedOutput }) => {
  try {
    const score = complexCalculation(output, expectedOutput);
    return {
      name: "complex_score",
      value: score
    };
  } catch (error) {
    console.error("Evaluator failed:", error);
    throw error; // Will be caught and logged by experiment system
  }
};

// LLM-as-judge evaluator
const llmJudgeEvaluator: Evaluator = async ({ input, output, expectedOutput }) => {
  const judgmentPrompt = `
    Input: ${input}
    Expected: ${expectedOutput}
    Actual: ${output}

    Rate the quality from 0 to 1:
  `;

  const judgment = await llm.evaluate(judgmentPrompt);

  return {
    name: "llm_judgment",
    value: parseFloat(judgment),
    comment: `LLM evaluation of output quality`
  };
};

RunEvaluator

Function type for run-level evaluators that assess the entire experiment.

/**
 * Evaluator function for run-level evaluation
 *
 * Receives all item results and performs aggregate analysis
 * across the entire experiment run.
 *
 * @param params - Parameters including all itemResults
 * @returns Promise resolving to single Evaluation or array of Evaluations
 */
type RunEvaluator<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = (
  params: RunEvaluatorParams<Input, ExpectedOutput, Metadata>
) => Promise<Evaluation[] | Evaluation>;

type RunEvaluatorParams<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = {
  /**
   * Results from all processed experiment items.
   *
   * Each item contains the input, output, evaluations, and metadata from
   * processing a single data item. Use this for aggregate analysis,
   * statistical calculations, and cross-item comparisons.
   */
  itemResults: ExperimentItemResult<Input, ExpectedOutput, Metadata>[];
};

Usage Examples:

import type { RunEvaluator } from '@langfuse/client';

// Average score evaluator
const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {
  const scores = itemResults
    .flatMap(r => r.evaluations)
    .filter(e => e.name === "accuracy")
    .map(e => e.value as number);

  const average = scores.reduce((a, b) => a + b, 0) / scores.length;

  return {
    name: "average_accuracy",
    value: average,
    comment: `Average accuracy: ${(average * 100).toFixed(1)}%`
  };
};

// Multiple run-level metrics
const comprehensiveRunEvaluator: RunEvaluator = async ({ itemResults }) => {
  const accuracyScores = itemResults
    .flatMap(r => r.evaluations)
    .filter(e => e.name === "accuracy")
    .map(e => e.value as number);

  const average = accuracyScores.reduce((a, b) => a + b, 0) / accuracyScores.length;
  const min = Math.min(...accuracyScores);
  const max = Math.max(...accuracyScores);
  const stdDev = calculateStdDev(accuracyScores);

  return [
    {
      name: "average_accuracy",
      value: average
    },
    {
      name: "min_accuracy",
      value: min
    },
    {
      name: "max_accuracy",
      value: max
    },
    {
      name: "std_dev_accuracy",
      value: stdDev,
      comment: "Standard deviation of accuracy scores"
    }
  ];
};

// Precision and recall
const precisionRecallEvaluator: RunEvaluator = async ({ itemResults }) => {
  let truePositives = 0;
  let falsePositives = 0;
  let falseNegatives = 0;

  for (const result of itemResults) {
    if (result.output === "positive") {
      if (result.expectedOutput === "positive") {
        truePositives++;
      } else {
        falsePositives++;
      }
    } else if (result.expectedOutput === "positive") {
      falseNegatives++;
    }
  }

  const precision = truePositives / (truePositives + falsePositives);
  const recall = truePositives / (truePositives + falseNegatives);
  const f1 = 2 * (precision * recall) / (precision + recall);

  return [
    {
      name: "precision",
      value: precision,
      comment: `Precision: ${(precision * 100).toFixed(1)}%`
    },
    {
      name: "recall",
      value: recall,
      comment: `Recall: ${(recall * 100).toFixed(1)}%`
    },
    {
      name: "f1_score",
      value: f1,
      comment: `F1 Score: ${(f1 * 100).toFixed(1)}%`
    }
  ];
};

// Category-based analysis
const categoryAnalysisEvaluator: RunEvaluator<any, any, { category: string }> = async ({
  itemResults
}) => {
  const categories = new Map<string, number[]>();

  for (const result of itemResults) {
    const category = result.item.metadata?.category || "unknown";
    const accuracy = result.evaluations.find(e => e.name === "accuracy")?.value as number;

    if (!categories.has(category)) {
      categories.set(category, []);
    }
    categories.get(category)!.push(accuracy);
  }

  const evaluations: Evaluation[] = [];

  for (const [category, scores] of categories) {
    const average = scores.reduce((a, b) => a + b, 0) / scores.length;
    evaluations.push({
      name: `accuracy_${category}`,
      value: average,
      comment: `Average accuracy for ${category}: ${(average * 100).toFixed(1)}%`
    });
  }

  return evaluations;
};

// Percentile analysis
const percentileEvaluator: RunEvaluator = async ({ itemResults }) => {
  const scores = itemResults
    .flatMap(r => r.evaluations)
    .filter(e => e.name === "score")
    .map(e => e.value as number)
    .sort((a, b) => a - b);

  const p50 = scores[Math.floor(scores.length * 0.5)];
  const p90 = scores[Math.floor(scores.length * 0.9)];
  const p95 = scores[Math.floor(scores.length * 0.95)];

  return [
    { name: "p50_score", value: p50, comment: "Median score" },
    { name: "p90_score", value: p90, comment: "90th percentile" },
    { name: "p95_score", value: p95, comment: "95th percentile" }
  ];
};

// Failure analysis
const failureAnalysisEvaluator: RunEvaluator = async ({ itemResults }) => {
  const failures = itemResults.filter(r => {
    const accuracy = r.evaluations.find(e => e.name === "accuracy")?.value;
    return accuracy === 0;
  });

  const failureRate = failures.length / itemResults.length;

  return {
    name: "failure_rate",
    value: failureRate,
    comment: `${failures.length} of ${itemResults.length} items failed (${(failureRate * 100).toFixed(1)}%)`
  };
};

// Cross-item consistency
const consistencyEvaluator: RunEvaluator = async ({ itemResults }) => {
  // Check if similar inputs produce similar outputs
  const consistency = analyzeConsistency(itemResults);

  return {
    name: "consistency_score",
    value: consistency,
    comment: "Consistency across similar inputs"
  };
};

Evaluation

Result type for evaluations returned by evaluator functions.

/**
 * Evaluation result from an evaluator
 *
 * Contains the score name, value, and optional metadata/comment
 */
type Evaluation = Pick<
  ScoreBody,
  "name" | "value" | "comment" | "metadata" | "dataType"
>;

interface Evaluation {
  /**
   * Name of the evaluation metric
   *
   * Should be descriptive and unique within the evaluator set.
   */
  name: string;

  /**
   * Numeric or boolean value of the evaluation
   *
   * Typically 0-1 for accuracy/similarity scores, but can be any numeric value.
   */
  value: number | boolean;

  /**
   * Optional human-readable comment about the evaluation
   *
   * Useful for explaining the score or providing context.
   */
  comment?: string;

  /**
   * Optional metadata about the evaluation
   *
   * Store additional context or debugging information.
   */
  metadata?: Record<string, any>;

  /**
   * Optional data type specification
   *
   * Specifies how the value should be interpreted.
   */
  dataType?: "numeric" | "boolean" | "categorical";
}

Usage Examples:

import type { Evaluation } from '@langfuse/client';

// Simple numeric evaluation
const simpleEval: Evaluation = {
  name: "accuracy",
  value: 0.85
};

// Boolean evaluation
const booleanEval: Evaluation = {
  name: "passed",
  value: true,
  dataType: "boolean"
};

// Evaluation with comment
const commentedEval: Evaluation = {
  name: "similarity",
  value: 0.92,
  comment: "High similarity between output and expected"
};

// Evaluation with metadata
const metadataEval: Evaluation = {
  name: "response_quality",
  value: 0.88,
  metadata: {
    model: "gpt-4",
    temperature: 0.7,
    tokens: 150
  },
  comment: "Quality assessment using LLM judge"
};

// Multiple evaluation types
const multiEval: Evaluation[] = [
  {
    name: "exact_match",
    value: 1,
    dataType: "boolean"
  },
  {
    name: "similarity",
    value: 0.95,
    dataType: "numeric",
    comment: "Cosine similarity"
  },
  {
    name: "category",
    value: 0,
    dataType: "categorical",
    metadata: { predicted: "A", actual: "B" }
  }
];

ExperimentResult

Complete result structure returned by the run() method.

/**
 * Complete result of an experiment execution
 *
 * Contains all results from processing the experiment data,
 * including individual item results, run-level evaluations,
 * and utilities for result visualization.
 */
type ExperimentResult<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = {
  /**
   * The experiment run name.
   *
   * Either the provided runName parameter or generated name (experiment name + timestamp).
   */
  runName: string;

  /**
   * ID of the dataset run in Langfuse (only for experiments on Langfuse datasets).
   *
   * Use this ID to access the dataset run via the Langfuse API or UI.
   */
  datasetRunId?: string;

  /**
   * URL to the dataset run in the Langfuse UI (only for experiments on Langfuse datasets).
   *
   * Direct link to view the complete dataset run in the Langfuse web interface.
   */
  datasetRunUrl?: string;

  /**
   * Results from processing each individual data item.
   *
   * Contains the complete results for every item in your experiment data.
   */
  itemResults: ExperimentItemResult<Input, ExpectedOutput, Metadata>[];

  /**
   * Results from run-level evaluators that assessed the entire experiment.
   *
   * Contains aggregate evaluations that analyze the complete experiment.
   */
  runEvaluations: Evaluation[];

  /**
   * Function to format experiment results in a human-readable format.
   *
   * @param options - Formatting options
   * @param options.includeItemResults - Whether to include individual item details (default: false)
   * @returns Promise resolving to formatted string representation
   */
  format: (options?: { includeItemResults?: boolean }) => Promise<string>;
};

Usage Examples:

import type { ExperimentResult } from '@langfuse/client';

// Run experiment and access results
const result: ExperimentResult = await langfuse.experiment.run({
  name: "Test Experiment",
  data: testData,
  task: myTask,
  evaluators: [accuracyEvaluator],
  runEvaluators: [averageEvaluator]
});

// Access run name
console.log(`Run name: ${result.runName}`);
// "Test Experiment - 2024-01-15T10:30:00.000Z"

// Access individual item results
console.log(`Processed ${result.itemResults.length} items`);
for (const itemResult of result.itemResults) {
  console.log(`Input: ${itemResult.input}`);
  console.log(`Output: ${itemResult.output}`);
  console.log(`Evaluations:`, itemResult.evaluations);
}

// Access run-level evaluations
console.log(`Run evaluations:`, result.runEvaluations);
const avgAccuracy = result.runEvaluations.find(e => e.name === "average_accuracy");
console.log(`Average accuracy: ${avgAccuracy?.value}`);

// Format results (summary only)
const summary = await result.format();
console.log(summary);
/*
Individual Results: Hidden (10 items)
💡 Call format({ includeItemResults: true }) to view them

──────────────────────────────────────────────────
🧪 Experiment: Test Experiment
📋 Run name: Test Experiment - 2024-01-15T10:30:00.000Z
10 items
Evaluations:
  • accuracy

Average Scores:
  • accuracy: 0.850

Run Evaluations:
  • average_accuracy: 0.850
    💭 Average accuracy: 85.0%
*/

// Format with detailed results
const detailed = await result.format({ includeItemResults: true });
console.log(detailed);
/*
1. Item 1:
   Input:    What is AI?
   Expected: Artificial Intelligence
   Actual:   Artificial Intelligence
   Scores:
     • accuracy: 1.000

   Trace:
   https://cloud.langfuse.com/project/xxx/traces/abc123

2. Item 2:
   ...

──────────────────────────────────────────────────
🧪 Experiment: Test Experiment
...
*/

// Access dataset run information (if applicable)
if (result.datasetRunId) {
  console.log(`Dataset run ID: ${result.datasetRunId}`);
  console.log(`View in UI: ${result.datasetRunUrl}`);
}

// Calculate custom metrics from results
const successRate = result.itemResults.filter(r =>
  r.evaluations.some(e => e.name === "accuracy" && e.value === 1)
).length / result.itemResults.length;
console.log(`Success rate: ${(successRate * 100).toFixed(1)}%`);

// Export results for further analysis
const exportData = result.itemResults.map(r => ({
  input: r.input,
  output: r.output,
  expectedOutput: r.expectedOutput,
  scores: Object.fromEntries(
    r.evaluations.map(e => [e.name, e.value])
  )
}));
await fs.writeFile('results.json', JSON.stringify(exportData, null, 2));

ExperimentItemResult

Result structure for individual item processing within an experiment.

/**
 * Result from processing one experiment item
 *
 * Contains the input, output, evaluations, and trace information
 * for a single data item.
 */
type ExperimentItemResult<
  Input = any,
  ExpectedOutput = any,
  Metadata extends Record<string, any> = Record<string, any>
> = {
  /**
   * The original experiment or dataset item that was processed.
   *
   * Contains the complete original item data.
   */
  item: ExperimentItem<Input, ExpectedOutput, Metadata>;

  /**
   * The input data (extracted from item for convenience)
   */
  input?: Input;

  /**
   * The expected output (extracted from item for convenience)
   */
  expectedOutput?: ExpectedOutput;

  /**
   * The actual output produced by the task.
   *
   * This is the result returned by your task function for this specific input.
   */
  output: any;

  /**
   * Results from all evaluators that ran on this item.
   *
   * Contains evaluation scores, comments, and metadata from each evaluator.
   */
  evaluations: Evaluation[];

  /**
   * Langfuse trace ID for this item's execution.
   *
   * Use this ID to view detailed execution traces in the Langfuse UI.
   */
  traceId?: string;

  /**
   * Dataset run ID if this item was part of a Langfuse dataset.
   *
   * Links this item result to a specific dataset run.
   */
  datasetRunId?: string;
};

Usage Examples:

import type { ExperimentItemResult } from '@langfuse/client';

// Process experiment results
const result = await langfuse.experiment.run(config);

for (const itemResult: ExperimentItemResult of result.itemResults) {
  // Access item data
  console.log(`Processing item:`, itemResult.item);
  console.log(`Input:`, itemResult.input);
  console.log(`Expected:`, itemResult.expectedOutput);
  console.log(`Actual:`, itemResult.output);

  // Access evaluations
  for (const evaluation of itemResult.evaluations) {
    console.log(`${evaluation.name}: ${evaluation.value}`);
    if (evaluation.comment) {
      console.log(`  Comment: ${evaluation.comment}`);
    }
  }

  // Access trace information
  if (itemResult.traceId) {
    const traceUrl = await langfuse.getTraceUrl(itemResult.traceId);
    console.log(`View trace: ${traceUrl}`);
  }

  // Access dataset run information
  if (itemResult.datasetRunId) {
    console.log(`Dataset run ID: ${itemResult.datasetRunId}`);
  }
}

// Filter failed items
const failedItems = result.itemResults.filter(r =>
  r.evaluations.some(e => e.name === "accuracy" && e.value === 0)
);
console.log(`Failed items: ${failedItems.length}`);

// Group by score
const highScoring = result.itemResults.filter(r =>
  r.evaluations.some(e => e.name === "accuracy" && (e.value as number) >= 0.8)
);
const lowScoring = result.itemResults.filter(r =>
  r.evaluations.some(e => e.name === "accuracy" && (e.value as number) < 0.5)
);

// Analyze patterns
const errorPatterns = failedItems.map(r => ({
  input: r.input,
  output: r.output,
  expected: r.expectedOutput
}));
console.log("Error patterns:", errorPatterns);

Integration with AutoEvals

Create Langfuse-compatible evaluators from AutoEvals library evaluators.

/**
 * Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function
 *
 * This adapter handles parameter mapping and result formatting automatically.
 * AutoEvals evaluators expect `input`, `output`, and `expected` parameters,
 * while Langfuse evaluators use `input`, `output`, and `expectedOutput`.
 *
 * @param autoevalEvaluator - The AutoEvals evaluator function to convert
 * @param params - Optional additional parameters to pass to the AutoEvals evaluator
 * @returns A Langfuse-compatible evaluator function
 */
function createEvaluatorFromAutoevals<E extends CallableFunction>(
  autoevalEvaluator: E,
  params?: Params<E>
): Evaluator;

Usage Examples:

import { Factuality, Levenshtein, ClosedQA } from 'autoevals';
import { createEvaluatorFromAutoevals } from '@langfuse/client';

// Basic AutoEvals integration
const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);
const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);

await langfuse.experiment.run({
  name: "AutoEvals Integration Test",
  data: myDataset,
  task: myTask,
  evaluators: [factualityEvaluator, levenshteinEvaluator]
});

// With additional parameters
const customFactualityEvaluator = createEvaluatorFromAutoevals(
  Factuality,
  { model: 'gpt-4o' } // Additional params for AutoEvals
);

await langfuse.experiment.run({
  name: "Factuality Test",
  data: testData,
  task: myTask,
  evaluators: [customFactualityEvaluator]
});

// Multiple AutoEvals evaluators
const closedQAEvaluator = createEvaluatorFromAutoevals(ClosedQA, {
  model: 'gpt-4',
  useCoT: true
});

const comprehensiveEvaluators = [
  createEvaluatorFromAutoevals(Factuality),
  createEvaluatorFromAutoevals(Levenshtein),
  closedQAEvaluator
];

await langfuse.experiment.run({
  name: "Comprehensive Evaluation",
  data: qaDataset,
  task: qaTask,
  evaluators: comprehensiveEvaluators
});

// Mixing AutoEvals and custom evaluators
await langfuse.experiment.run({
  name: "Mixed Evaluators",
  data: dataset,
  task: task,
  evaluators: [
    // AutoEvals evaluators
    createEvaluatorFromAutoevals(Factuality),
    createEvaluatorFromAutoevals(Levenshtein),
    // Custom evaluator
    async ({ output, expectedOutput }) => ({
      name: "exact_match",
      value: output === expectedOutput ? 1 : 0
    })
  ]
});

Advanced Usage

Type Safety with Generics

Use TypeScript generics for full type safety across the experiment pipeline.

// Define your types
interface QuestionInput {
  question: string;
  context: string[];
}

interface AnswerOutput {
  answer: string;
  confidence: number;
  sources: string[];
}

interface ItemMetadata {
  category: "science" | "history" | "literature";
  difficulty: number;
  tags: string[];
}

// Type-safe experiment configuration
const result = await langfuse.experiment.run<
  QuestionInput,
  AnswerOutput,
  ItemMetadata
>({
  name: "Typed QA Experiment",
  data: [
    {
      input: {
        question: "What is photosynthesis?",
        context: ["Photosynthesis is the process..."]
      },
      expectedOutput: {
        answer: "A process where plants convert light to energy",
        confidence: 0.9,
        sources: ["biology textbook"]
      },
      metadata: {
        category: "science",
        difficulty: 5,
        tags: ["biology", "plants"]
      }
    }
  ],
  task: async ({ input, metadata }) => {
    // input is typed as QuestionInput
    // metadata is typed as ItemMetadata
    const { question, context } = input;
    const difficulty = metadata?.difficulty || 5;

    return await qaModel(question, context, difficulty);
    // Return type should match AnswerOutput
  },
  evaluators: [
    async ({ input, output, expectedOutput }) => {
      // All parameters are fully typed
      // input: QuestionInput
      // output: any (task output)
      // expectedOutput: AnswerOutput | undefined

      return {
        name: "answer_quality",
        value: output.confidence
      };
    }
  ]
});

// Result is typed as ExperimentResult<QuestionInput, AnswerOutput, ItemMetadata>
for (const itemResult of result.itemResults) {
  // itemResult.input is QuestionInput
  // itemResult.output is any
  // itemResult.expectedOutput is AnswerOutput | undefined
  console.log(itemResult.input.question);
  console.log(itemResult.expectedOutput?.confidence);
}

Parallel vs Sequential Execution

Control experiment execution parallelism with maxConcurrency.

// Fully parallel (default)
const parallelResult = await langfuse.experiment.run({
  name: "Parallel Execution",
  data: largeDataset,
  task: fastTask,
  evaluators: [evaluator]
  // maxConcurrency: Infinity (default)
});

// Sequential execution
const sequentialResult = await langfuse.experiment.run({
  name: "Sequential Execution",
  data: dataset,
  task: task,
  maxConcurrency: 1 // Process one item at a time
});

// Controlled parallelism
const controlledResult = await langfuse.experiment.run({
  name: "Rate Limited Execution",
  data: dataset,
  task: expensiveAPICall,
  maxConcurrency: 5 // Max 5 concurrent API calls
});

// Batched processing
const batchSize = 10;
const batchedResult = await langfuse.experiment.run({
  name: "Batched Processing",
  data: veryLargeDataset,
  task: task,
  maxConcurrency: batchSize // Process in batches of 10
});

Dataset Integration

Run experiments directly on Langfuse datasets with automatic linking.

// Get dataset
const dataset = await langfuse.dataset.get("my-dataset");

// Run experiment on dataset (automatic data parameter)
const result = await dataset.runExperiment({
  name: "GPT-4 Evaluation",
  task: async ({ input }) => {
    // Process dataset item
    return await model(input);
  },
  evaluators: [evaluator],
  runEvaluators: [averageEvaluator]
});

// Results are automatically linked to dataset run
console.log(`Dataset run ID: ${result.datasetRunId}`);
console.log(`View in UI: ${result.datasetRunUrl}`);

// Each item result is linked
for (const itemResult of result.itemResults) {
  console.log(`Dataset run ID: ${itemResult.datasetRunId}`);
  console.log(`Trace ID: ${itemResult.traceId}`);
}

// Compare multiple runs on same dataset
const run1 = await dataset.runExperiment({
  name: "Model A",
  runName: "model-a-run-1",
  task: modelA,
  evaluators: [evaluator]
});

const run2 = await dataset.runExperiment({
  name: "Model B",
  runName: "model-b-run-1",
  task: modelB,
  evaluators: [evaluator]
});

// Compare results
console.log("Model A avg:", run1.runEvaluations[0].value);
console.log("Model B avg:", run2.runEvaluations[0].value);

Result Formatting

Use the format() function to generate human-readable result summaries.

const result = await langfuse.experiment.run({
  name: "Test Experiment",
  data: testData,
  task: task,
  evaluators: [evaluator],
  runEvaluators: [runEvaluator]
});

// Format summary (default)
const summary = await result.format();
console.log(summary);
/*
Individual Results: Hidden (50 items)
💡 Call format({ includeItemResults: true }) to view them

──────────────────────────────────────────────────
🧪 Experiment: Test Experiment
📋 Run name: Test Experiment - 2024-01-15T10:30:00.000Z
50 items
Evaluations:
  • accuracy
  • f1_score

Average Scores:
  • accuracy: 0.850
  • f1_score: 0.823

Run Evaluations:
  • average_accuracy: 0.850
    💭 Average accuracy: 85.0%
  • precision: 0.875
    💭 Precision: 87.5%

🔗 Dataset Run:
   https://cloud.langfuse.com/project/xxx/datasets/yyy/runs/zzz
*/

// Format with detailed item results
const detailed = await result.format({ includeItemResults: true });
console.log(detailed);
/*
1. Item 1:
   Input:    What is the capital of France?
   Expected: Paris
   Actual:   Paris
   Scores:
     • exact_match: 1.000
     • similarity: 1.000

   Dataset Item:
   https://cloud.langfuse.com/project/xxx/datasets/yyy/items/123

   Trace:
   https://cloud.langfuse.com/project/xxx/traces/abc123

2. Item 2:
   Input:    What is 2+2?
   Expected: 4
   Actual:   4
   Scores:
     • exact_match: 1.000
     • similarity: 1.000

   Trace:
   https://cloud.langfuse.com/project/xxx/traces/def456

... (50 items total)

──────────────────────────────────────────────────
🧪 Experiment: Test Experiment
... (summary as above)
*/

// Save formatted results to file
const formatted = await result.format({ includeItemResults: true });
await fs.writeFile('experiment-results.txt', formatted);

// Use in CI/CD
const summary = await result.format();
console.log(summary);
if (result.runEvaluations.some(e => e.name === "average_accuracy" && (e.value as number) < 0.8)) {
  throw new Error("Experiment failed: accuracy below threshold");
}

Error Handling Strategies

Implement robust error handling for production experiments.

// Task with retry logic
const resilientTask: ExperimentTask = async ({ input }) => {
  let lastError;
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await apiCall(input);
    } catch (error) {
      lastError = error;
      await new Promise(resolve => setTimeout(resolve, 1000 * (attempt + 1)));
    }
  }
  throw lastError;
};

// Task with fallback
const fallbackTask: ExperimentTask = async ({ input }) => {
  try {
    return await primaryModel(input);
  } catch (error) {
    console.warn("Primary model failed, using fallback");
    return await fallbackModel(input);
  }
};

// Task with timeout
const timeoutTask: ExperimentTask = async ({ input }) => {
  return await Promise.race([
    modelCall(input),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error("Timeout")), 30000)
    )
  ]);
};

// Evaluator with validation
const validatingEvaluator: Evaluator = async ({ output, expectedOutput }) => {
  try {
    if (typeof output !== 'string' || typeof expectedOutput !== 'string') {
      throw new Error("Invalid output types");
    }

    return {
      name: "accuracy",
      value: output === expectedOutput ? 1 : 0
    };
  } catch (error) {
    console.error("Evaluator validation failed:", error);
    return {
      name: "accuracy",
      value: 0,
      comment: `Validation error: ${error.message}`
    };
  }
};

// Run experiment with error tracking
const result = await langfuse.experiment.run({
  name: "Resilient Experiment",
  data: testData,
  task: resilientTask,
  evaluators: [validatingEvaluator]
});

// Check for failures
const successCount = result.itemResults.length;
const totalCount = testData.length;
const failureCount = totalCount - successCount;

if (failureCount > 0) {
  console.warn(`${failureCount} items failed during experiment`);
}

Best Practices

Experiment Organization

// ✅ Good: Descriptive naming
await langfuse.experiment.run({
  name: "GPT-4 vs GPT-3.5 on QA Dataset",
  runName: "gpt-4-2024-01-15-temp-0.7",
  description: "Comparing model performance with temperature 0.7",
  metadata: {
    model_version: "gpt-4-0125-preview",
    temperature: 0.7,
    dataset_version: "v2.1"
  }
});

// ❌ Bad: Generic naming
await langfuse.experiment.run({
  name: "Test",
  data: data,
  task: task
});

Evaluator Design

// ✅ Good: Multiple focused evaluators
const evaluators = [
  // Simple binary check
  async ({ output, expectedOutput }) => ({
    name: "exact_match",
    value: output === expectedOutput ? 1 : 0
  }),
  // Similarity score
  async ({ output, expectedOutput }) => ({
    name: "cosine_similarity",
    value: calculateCosineSimilarity(output, expectedOutput)
  }),
  // Format validation
  async ({ output }) => ({
    name: "format_valid",
    value: validateFormat(output) ? 1 : 0
  })
];

// ❌ Bad: One complex evaluator doing everything
const badEvaluator = async ({ output, expectedOutput }) => ({
  name: "score",
  value: complexCalculation(output, expectedOutput)
  // Unclear what this represents
});

Concurrency Management

// ✅ Good: Appropriate concurrency limits
await langfuse.experiment.run({
  name: "Rate-Limited API Experiment",
  data: largeDataset,
  task: expensiveAPICall,
  maxConcurrency: 5, // Respect API rate limits
  evaluators: [evaluator]
});

// ✅ Good: High concurrency for local operations
await langfuse.experiment.run({
  name: "Local Model Experiment",
  data: dataset,
  task: localModelInference,
  maxConcurrency: 50, // Local model can handle high concurrency
  evaluators: [evaluator]
});

// ❌ Bad: No concurrency control for rate-limited API
await langfuse.experiment.run({
  name: "Uncontrolled Experiment",
  data: largeDataset,
  task: rateLimitedAPI
  // Will likely hit rate limits
});

Type Safety

// ✅ Good: Explicit types
interface Input {
  question: string;
  context: string;
}

interface Output {
  answer: string;
  confidence: number;
}

const result = await langfuse.experiment.run<Input, Output>({
  name: "Typed Experiment",
  data: [
    {
      input: { question: "...", context: "..." },
      expectedOutput: { answer: "...", confidence: 0.9 }
    }
  ],
  task: async ({ input }) => {
    // input is typed as Input
    return await processTyped(input);
  }
});

// ❌ Bad: Implicit any types
const result = await langfuse.experiment.run({
  name: "Untyped Experiment",
  data: [{ input: someData }],
  task: async ({ input }) => {
    // input is any
    return await process(input);
  }
});

Result Analysis

// ✅ Good: Use run evaluators for aggregates
await langfuse.experiment.run({
  name: "Analysis Experiment",
  data: dataset,
  task: task,
  evaluators: [itemEvaluator],
  runEvaluators: [
    async ({ itemResults }) => {
      // Calculate aggregate metrics
      const avg = calculateAverage(itemResults);
      const stdDev = calculateStdDev(itemResults);

      return [
        { name: "average", value: avg },
        { name: "std_dev", value: stdDev }
      ];
    }
  ]
});

// ❌ Bad: Manual aggregation after experiment
const result = await langfuse.experiment.run({
  name: "Manual Analysis",
  data: dataset,
  task: task,
  evaluators: [itemEvaluator]
});

// Manually calculating aggregates (should use run evaluators)
const scores = result.itemResults.map(r => r.evaluations[0].value);
const avg = scores.reduce((a, b) => a + b) / scores.length;

Performance Considerations

Batching and Concurrency

Use maxConcurrency to control parallelism and avoid overwhelming external APIs
Default maxConcurrency: Infinity is suitable for local operations
Set maxConcurrency: 1 for sequential processing when order matters
Typical values: 3-10 for API calls, 20-100 for local operations

Memory Management

Large datasets are processed in batches based on maxConcurrency
Each batch is processed completely before moving to the next
Failed items are logged and skipped, not stored in memory
Consider breaking very large experiments into multiple smaller runs

Tracing Overhead

OpenTelemetry tracing adds minimal overhead (~1-5ms per item)
Traces are sent asynchronously and don't block experiment execution
Disable tracing for maximum performance (though not recommended)
Use flush() to ensure all traces are sent before shutdown

Evaluator Performance

Item-level evaluators run in parallel with task execution
Failed evaluators don't block other evaluators
LLM-as-judge evaluators can be slow; use maxConcurrency to control them
Run-level evaluators execute sequentially after all items complete

Install with Tessl CLI

npx tessl i tessl/npm-langfuse--client@4.2.1

docs

tessl/npm-langfuse--client

experiments.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Experiment Execution

Capabilities

Run Experiment

Type Definitions

ExperimentParams

ExperimentTask

ExperimentItem

Evaluator

RunEvaluator

Evaluation

ExperimentResult

ExperimentItemResult

Integration with AutoEvals

Advanced Usage

Type Safety with Generics

Parallel vs Sequential Execution

Dataset Integration

Result Formatting

Error Handling Strategies

Best Practices

Experiment Organization

Evaluator Design

Concurrency Management

Type Safety

Result Analysis

Performance Considerations

Batching and Concurrency

Memory Management

Tracing Overhead

Evaluator Performance

experiments.mddocs/