or run

npx @tessl/cli init

LangSmith Evaluation

LangSmith Evaluation provides a comprehensive framework for evaluating AI applications and LLM outputs. It offers both simple dataset-based evaluation and advanced comparative evaluation across multiple experiments, with support for custom evaluators and various evaluation patterns.

Package Information

Package Name: langsmith
Package Type: npm
Language: TypeScript
Installation: npm install langsmith
Module: langsmith/evaluation

Core Imports

import {
  evaluate,
  evaluateComparative,
  StringEvaluator,
  DynamicRunEvaluator
} from "langsmith/evaluation";

For CommonJS:

const {
  evaluate,
  evaluateComparative,
  StringEvaluator,
  DynamicRunEvaluator
} = require("langsmith/evaluation");

Basic Usage

import { evaluate } from "langsmith/evaluation";

// Define your target function
async function myLLMApp(input: { question: string }) {
  // Your LLM logic here
  return { answer: "The capital is Paris" };
}

// Create a custom evaluator
const correctnessEvaluator = ({ run, example }) => {
  const correct = run.outputs?.answer === example?.outputs?.answer;
  return {
    key: "correctness",
    score: correct ? 1 : 0
  };
};

// Run evaluation
const results = await evaluate(myLLMApp, {
  data: "my-dataset-name",
  evaluators: [correctnessEvaluator],
  experiment_name: "my-experiment"
});

Architecture

LangSmith Evaluation is built around several key components:

Evaluation Functions: Core functions (evaluate, evaluateComparative) for running evaluations against datasets and experiments
Evaluator Types: Flexible evaluator system supporting functions, classes, and custom implementations
Run-Based Evaluation: Evaluators that operate on complete run traces with full context
Summary Evaluation: Aggregate evaluators that analyze results across all examples
Comparative Analysis: Tools for comparing multiple experiments side-by-side

Capabilities

Dataset Evaluation

Run evaluations on a target function against a dataset of examples. Automatically creates experiments and tracks all results.

function evaluate<Inputs, Output>(
  target: TargetT<Inputs, Output>,
  options: EvaluateOptions
): Promise<EvaluationResults>;

Dataset Evaluation

Comparative Evaluation

Compare multiple experiments using specialized comparative evaluators to determine which performs better.

function evaluateComparative(
  experiments: string[],
  options: EvaluateComparativeOptions
): Promise<ComparisonEvaluationResults>;

Comparative Evaluation

Custom Evaluators

Create custom evaluators using functions or classes for run-based evaluation.

class StringEvaluator {
  evaluateStrings(params: StringEvaluatorParams): Promise<EvaluationResult>;
}

class DynamicRunEvaluator {
  evaluateRun(run: Run, example?: Example): Promise<EvaluationResult>;
}

Custom Evaluators

Evaluation Interfaces

Type-safe interfaces for evaluation options, results, and evaluator implementations.

interface EvaluateOptions {
  data?: DataT;
  evaluators?: EvaluatorT[];
  summary_evaluators?: SummaryEvaluatorT[];
  metadata?: KVMap;
  experiment_prefix?: string;
  experiment_name?: string;
  description?: string;
  max_concurrency?: number;
  client?: Client;
  num_repetitions?: number;
  blocking?: boolean;
}

interface EvaluationResult {
  key?: string;
  score?: number | boolean;
  value?: string | number | boolean | object;
  comment?: string;
  correction?: object;
  evaluatorInfo?: object;
  sourceRunId?: string;
}

interface EvaluationResults {
  results: ExperimentResultRow[];
  summaryResults?: object;
}

Evaluation Interfaces

Dataset Evaluation

Run evaluations on a target function against a dataset of examples. The evaluation system automatically creates experiments, runs your target function on each example, and applies evaluators to score the results.

Evaluate Function

Creates and runs a complete evaluation experiment against a dataset.

/**
 * Run evaluation on a target function against a dataset
 * @param target - The function to evaluate (receives example inputs, returns outputs)
 * @param options - Evaluation configuration options
 * @returns Promise resolving to evaluation results with scores and summaries
 */
function evaluate<Inputs, Output>(
  target: TargetT<Inputs, Output>,
  options: EvaluateOptions
): Promise<EvaluationResults>;

type TargetT<Inputs, Output> = (inputs: Inputs) => Output | Promise<Output>;

interface EvaluateOptions {
  /** Dataset name, ID, or array of examples to evaluate against */
  data?: DataT;
  /** List of evaluators to score each run */
  evaluators?: EvaluatorT[];
  /** Evaluators that run on aggregate results across all examples */
  summary_evaluators?: SummaryEvaluatorT[];
  /** Metadata to attach to the experiment */
  metadata?: KVMap;
  /** Prefix for auto-generated experiment names */
  experiment_prefix?: string;
  /** Explicit experiment name (overrides prefix) */
  experiment_name?: string;
  /** Description of the experiment */
  description?: string;
  /** Maximum number of concurrent evaluations (default: 10) */
  max_concurrency?: number;
  /** LangSmith client instance (uses default if not provided) */
  client?: Client;
  /** Number of times to repeat evaluation on each example */
  num_repetitions?: number;
  /** Whether to block until evaluation completes (default: true) */
  blocking?: boolean;
}

type DataT = string | AsyncIterable<Example> | Example[];

/**
 * Note: Common errors when using evaluate():
 * - "Data not provided in this experiment" - occurs when the `data` parameter
 *   is missing or invalid. Ensure you provide a dataset name, dataset ID, or
 *   array of examples.
 * - Required parameters: Both `target` function and `data` (dataset name or
 *   examples array) must be provided.
 * - The `evaluators` array is required to score the evaluation results.
 */

Usage Examples:

import { evaluate } from "langsmith/evaluation";
import { Client } from "langsmith";

// Basic evaluation with dataset name
async function summarize(input: { text: string }) {
  // Your LLM summarization logic
  return { summary: "Generated summary..." };
}

const results = await evaluate(summarize, {
  data: "summarization-test-set",
  evaluators: [lengthEvaluator, coherenceEvaluator],
  experiment_name: "summarization-v1"
});

// Evaluation with custom dataset
const examples = [
  { inputs: { question: "What is 2+2?" }, outputs: { answer: "4" } },
  { inputs: { question: "What is 3+3?" }, outputs: { answer: "6" } }
];

await evaluate(myMathBot, {
  data: examples,
  evaluators: [correctnessEvaluator],
  max_concurrency: 5
});

// Evaluation with summary evaluators
await evaluate(myClassifier, {
  data: "classification-dataset",
  evaluators: [accuracyEvaluator],
  summary_evaluators: [
    (results) => ({
      key: "overall_accuracy",
      score: results.filter(r => r.score === 1).length / results.length
    })
  ],
  metadata: { model: "gpt-4", temperature: 0.7 }
});

// Evaluation with repetitions for variance analysis
await evaluate(myStochasticModel, {
  data: "test-dataset",
  evaluators: [consistencyEvaluator],
  num_repetitions: 3,
  description: "Testing model consistency across runs"
});

EvaluateOptions Interface

Configuration options for the evaluate function.

/**
 * Configuration options for dataset evaluation
 */
interface EvaluateOptions {
  /**
   * Dataset source: dataset name (string), dataset ID (string),
   * array of examples, or async iterable of examples
   */
  data?: DataT;

  /**
   * List of evaluators to apply to each run
   * Can be functions, RunEvaluator instances, or StringEvaluator instances
   */
  evaluators?: EvaluatorT[];

  /**
   * Evaluators that run once on all results to compute aggregate metrics
   */
  summary_evaluators?: SummaryEvaluatorT[];

  /**
   * Metadata key-value pairs to attach to the experiment
   */
  metadata?: KVMap;

  /**
   * Prefix for auto-generated experiment name (e.g., "gpt-4-" generates "gpt-4-20240115")
   */
  experiment_prefix?: string;

  /**
   * Explicit experiment name (overrides auto-generation and prefix)
   */
  experiment_name?: string;

  /**
   * Human-readable description of what this evaluation tests
   */
  description?: string;

  /**
   * Maximum number of examples to evaluate concurrently (default: 10)
   */
  max_concurrency?: number;

  /**
   * LangSmith client instance (creates default client if not provided)
   */
  client?: Client;

  /**
   * Number of times to run target function on each example (for variance analysis)
   */
  num_repetitions?: number;

  /**
   * Whether to wait for evaluation to complete before returning (default: true)
   * Set to false for async/background evaluation
   */
  blocking?: boolean;
}

EvaluationResults Interface

Results returned from an evaluation run.

/**
 * Results from a complete evaluation run
 */
interface EvaluationResults {
  /** Array of results for each example-run pair */
  results: ExperimentResultRow[];

  /** Summary metrics computed by summary_evaluators */
  summaryResults?: object;
}

/**
 * Single row in evaluation results, containing the run, example, and scores
 */
interface ExperimentResultRow {
  /** The trace/run created by executing target on this example */
  run: Run;

  /** The dataset example that was evaluated */
  example: Example;

  /** Array of evaluation results from all evaluators */
  evaluation_results: EvaluationResult[];
}

Usage Examples:

// Access individual results
const results = await evaluate(myFunction, options);

for (const row of results.results) {
  console.log(`Example ID: ${row.example.id}`);
  console.log(`Run ID: ${row.run.id}`);
  console.log(`Input: ${JSON.stringify(row.example.inputs)}`);
  console.log(`Output: ${JSON.stringify(row.run.outputs)}`);

  for (const evalResult of row.evaluation_results) {
    console.log(`  ${evalResult.key}: ${evalResult.score}`);
  }
}

// Access summary metrics
if (results.summaryResults) {
  console.log("Summary:", results.summaryResults);
}

// Calculate custom aggregations
const avgScore = results.results
  .flatMap(r => r.evaluation_results)
  .filter(e => e.key === "accuracy")
  .reduce((sum, e) => sum + (e.score || 0), 0) / results.results.length;

Comparative Evaluation

Run comparative evaluation across multiple experiments to determine which performs better. Uses specialized comparative evaluators that can analyze runs side-by-side.

EvaluateComparative Function

Compares multiple experiments using comparative evaluators.

/**
 * Run comparative evaluation across multiple experiments
 * @param experiments - Array of experiment names or IDs to compare
 * @param options - Comparative evaluation configuration
 * @returns Promise resolving to comparison results
 */
function evaluateComparative(
  experiments: string[],
  options: EvaluateComparativeOptions
): Promise<ComparisonEvaluationResults>;

interface EvaluateComparativeOptions {
  /** Array of comparative evaluators that compare runs side-by-side */
  comparativeEvaluators: ComparativeEvaluator[];

  /** LangSmith client instance */
  client?: Client;

  /** Cache of evaluation results to avoid recomputing */
  evaluationResults?: KVMap;

  /** Prefix for the comparative experiment name */
  experimentPrefix?: string;

  /** Maximum number of concurrent evaluations */
  maxConcurrency?: number;

  /** Description of the comparison */
  description?: string;

  /** Metadata to attach to the comparative experiment */
  metadata?: KVMap;

  /** Whether to load existing results from experiments */
  load?: boolean;
}

type ComparativeEvaluator = (
  runs: Run[],
  example: Example
) => Promise<ComparisonEvaluationResult> | ComparisonEvaluationResult;

interface ComparisonEvaluationResult {
  /** Evaluation key/name */
  key: string;

  /** Per-run scores (parallel to runs array) */
  scores?: (number | boolean)[];

  /** Overall comparison result */
  value?: any;

  /** Comment explaining the comparison */
  comment?: string;
}

interface ComparisonEvaluationResults {
  /** Results for each example across all experiments */
  results: ComparisonResultRow[];

  /** Summary statistics for the comparison */
  summaryResults?: object;
}

Usage Examples:

import { evaluateComparative } from "langsmith/evaluation";

// Compare two experiments
const comparison = await evaluateComparative(
  ["experiment-gpt-4", "experiment-gpt-3.5"],
  {
    comparativeEvaluators: [
      async (runs, example) => {
        // Compare response quality
        const scores = runs.map(run => {
          const output = run.outputs?.answer || "";
          const expected = example.outputs?.answer || "";
          return output === expected ? 1 : 0;
        });

        return {
          key: "correctness",
          scores,
          value: scores[0] > scores[1] ? "A" : "B",
          comment: `Experiment A: ${scores[0]}, Experiment B: ${scores[1]}`
        };
      }
    ],
    description: "Compare GPT-4 vs GPT-3.5 accuracy"
  }
);

// Compare with multiple evaluators
const preferenceEvaluator: ComparativeEvaluator = async (runs, example) => {
  // Use an LLM judge to determine preference
  const prompt = `Which response is better?\nA: ${runs[0].outputs?.answer}\nB: ${runs[1].outputs?.answer}`;
  const judgment = await llmJudge(prompt);

  return {
    key: "preference",
    value: judgment.preference,
    comment: judgment.reasoning
  };
};

const latencyEvaluator: ComparativeEvaluator = (runs, example) => {
  const latencies = runs.map(run =>
    (run.end_time || 0) - (run.start_time || 0)
  );

  return {
    key: "latency_ms",
    scores: latencies,
    value: latencies[0] < latencies[1] ? "A" : "B",
    comment: `A: ${latencies[0]}ms, B: ${latencies[1]}ms`
  };
};

await evaluateComparative(
  ["exp-a", "exp-b", "exp-c"],
  {
    comparativeEvaluators: [preferenceEvaluator, latencyEvaluator],
    maxConcurrency: 10,
    metadata: { comparison_type: "model_selection" }
  }
);

// Load existing results for comparison
await evaluateComparative(
  ["historical-experiment-1", "new-experiment"],
  {
    comparativeEvaluators: [regressionDetector],
    load: true,
    description: "Regression testing against baseline"
  }
);

EvaluateComparativeOptions Interface

Configuration for comparative evaluation.

/**
 * Configuration options for comparative evaluation
 */
interface EvaluateComparativeOptions {
  /**
   * Array of evaluator functions that compare runs across experiments
   * Each evaluator receives all runs for a given example and returns comparison results
   */
  comparativeEvaluators: ComparativeEvaluator[];

  /**
   * LangSmith client instance for API access
   */
  client?: Client;

  /**
   * Optional cache of evaluation results keyed by experiment name
   * Used to avoid recomputing results that already exist
   */
  evaluationResults?: KVMap;

  /**
   * Prefix for the comparative experiment name
   */
  experimentPrefix?: string;

  /**
   * Maximum number of examples to evaluate concurrently
   */
  maxConcurrency?: number;

  /**
   * Description of what this comparison is testing
   */
  description?: string;

  /**
   * Metadata to attach to the comparative experiment
   */
  metadata?: KVMap;

  /**
   * Whether to load existing results from the experiments
   * If true, uses cached results; if false, may recompute
   */
  load?: boolean;
}

Custom Evaluators

Create custom evaluators using functions or classes. Evaluators score runs and return structured feedback that gets logged to LangSmith.

Evaluator Functions

Evaluators can be simple functions that take run and example parameters and return evaluation results.

Usage Examples:

import { evaluate } from "langsmith/evaluation";

// Simple correctness evaluator
const correctnessEvaluator = ({ run, example }) => {
  const predicted = run.outputs?.answer;
  const expected = example?.outputs?.answer;

  return {
    key: "correctness",
    score: predicted === expected ? 1 : 0
  };
};

// Evaluator with error handling
const safetyEvaluator = async ({ run, example }) => {
  const output = run.outputs?.text || "";

  // Check for unsafe content
  const hasUnsafeContent = await checkSafety(output);

  return {
    key: "safety",
    score: hasUnsafeContent ? 0 : 1,
    comment: hasUnsafeContent ? "Contains unsafe content" : "Safe"
  };
};

// Evaluator using run metadata
const latencyEvaluator = ({ run }) => {
  const latency = (run.end_time || 0) - (run.start_time || 0);
  const isAcceptable = latency < 1000; // 1 second threshold

  return {
    key: "latency",
    score: isAcceptable ? 1 : 0,
    value: latency,
    comment: `${latency}ms`
  };
};

// Evaluator with structured feedback
const qualityEvaluator = async ({ run, example }) => {
  const output = run.outputs?.answer || "";
  const input = run.inputs?.question || "";

  // Use LLM to judge quality
  const judgment = await llmJudge({
    input,
    output,
    criteria: "accuracy, completeness, clarity"
  });

  return {
    key: "quality",
    score: judgment.score,
    value: judgment.ratings,
    comment: judgment.reasoning,
    evaluatorInfo: {
      model: "gpt-4",
      prompt_version: "v2"
    }
  };
};

// Use evaluators
await evaluate(myApp, {
  data: "test-dataset",
  evaluators: [
    correctnessEvaluator,
    safetyEvaluator,
    latencyEvaluator,
    qualityEvaluator
  ]
});

StringEvaluator Class

Base class for evaluators that compare strings (predictions vs references).

/**
 * String-based evaluator for text comparison
 * Useful for evaluating text generation, summarization, etc.
 */
class StringEvaluator {
  /**
   * Evaluate string outputs against references
   * @param params - String evaluator parameters
   * @returns Evaluation result with score
   */
  evaluateStrings(params: StringEvaluatorParams): Promise<EvaluationResult>;
}

/**
 * Parameters for string evaluation
 */
interface StringEvaluatorParams {
  /** The predicted/generated string to evaluate */
  prediction: string;

  /** Optional reference/expected string to compare against */
  reference?: string;

  /** Optional input string that generated the prediction */
  input?: string;
}

Usage Examples:

import { StringEvaluator } from "langsmith/evaluation";

// Custom string evaluator implementation
class ExactMatchEvaluator extends StringEvaluator {
  async evaluateStrings(params: StringEvaluatorParams): Promise<EvaluationResult> {
    const matches = params.prediction.trim() === params.reference?.trim();

    return {
      key: "exact_match",
      score: matches ? 1 : 0
    };
  }
}

// String length evaluator
class LengthEvaluator extends StringEvaluator {
  constructor(private minLength: number, private maxLength: number) {
    super();
  }

  async evaluateStrings(params: StringEvaluatorParams): Promise<EvaluationResult> {
    const length = params.prediction.length;
    const inRange = length >= this.minLength && length <= this.maxLength;

    return {
      key: "length",
      score: inRange ? 1 : 0,
      value: length,
      comment: `Length: ${length} (expected ${this.minLength}-${this.maxLength})`
    };
  }
}

// Semantic similarity evaluator
class SemanticSimilarityEvaluator extends StringEvaluator {
  async evaluateStrings(params: StringEvaluatorParams): Promise<EvaluationResult> {
    // Use embeddings to compute similarity
    const similarity = await computeSimilarity(
      params.prediction,
      params.reference || ""
    );

    return {
      key: "semantic_similarity",
      score: similarity,
      comment: `${(similarity * 100).toFixed(1)}% similar`
    };
  }
}

// Use string evaluators
const exactMatch = new ExactMatchEvaluator();
const lengthCheck = new LengthEvaluator(50, 200);
const semanticCheck = new SemanticSimilarityEvaluator();

await evaluate(summarizer, {
  data: "summaries-dataset",
  evaluators: [exactMatch, lengthCheck, semanticCheck]
});

DynamicRunEvaluator Class

Dynamic evaluator wrapper for run-based evaluation with configurable behavior.

/**
 * Dynamic evaluator wrapper that adapts evaluation logic at runtime
 * Useful for creating evaluators with configurable behavior
 */
class DynamicRunEvaluator {
  /**
   * Evaluate a run with dynamic logic
   * @param run - The run to evaluate
   * @param example - Optional reference example
   * @returns Evaluation result
   */
  evaluateRun(run: Run, example?: Example): Promise<EvaluationResult>;
}

Usage Examples:

import { DynamicRunEvaluator } from "langsmith/evaluation";

// Create evaluator with configurable thresholds
class ConfigurableThresholdEvaluator extends DynamicRunEvaluator {
  constructor(
    private metric: string,
    private threshold: number,
    private comparison: "gt" | "lt" | "eq"
  ) {
    super();
  }

  async evaluateRun(run: Run, example?: Example): Promise<EvaluationResult> {
    const value = run.outputs?.[this.metric];

    let passed = false;
    if (this.comparison === "gt") passed = value > this.threshold;
    else if (this.comparison === "lt") passed = value < this.threshold;
    else passed = value === this.threshold;

    return {
      key: this.metric,
      score: passed ? 1 : 0,
      value,
      comment: `${value} ${this.comparison} ${this.threshold}`
    };
  }
}

// Evaluator that adapts based on input type
class AdaptiveEvaluator extends DynamicRunEvaluator {
  async evaluateRun(run: Run, example?: Example): Promise<EvaluationResult> {
    const inputType = run.inputs?.type;

    // Different evaluation logic based on input type
    if (inputType === "question") {
      return this.evaluateQuestion(run, example);
    } else if (inputType === "summary") {
      return this.evaluateSummary(run, example);
    } else {
      return this.evaluateGeneral(run, example);
    }
  }

  private async evaluateQuestion(run: Run, example?: Example) {
    // Question-specific evaluation
    return { key: "question_quality", score: 1 };
  }

  private async evaluateSummary(run: Run, example?: Example) {
    // Summary-specific evaluation
    return { key: "summary_quality", score: 1 };
  }

  private async evaluateGeneral(run: Run, example?: Example) {
    // General evaluation
    return { key: "general_quality", score: 1 };
  }
}

// Use dynamic evaluators
await evaluate(myApp, {
  data: "dataset",
  evaluators: [
    new ConfigurableThresholdEvaluator("confidence", 0.8, "gt"),
    new ConfigurableThresholdEvaluator("latency", 1000, "lt"),
    new AdaptiveEvaluator()
  ]
});

Evaluation Interfaces

Type-safe interfaces for evaluation options, results, and evaluator implementations.

EvaluationResult Interface

Structure returned by evaluators describing the evaluation outcome.

/**
 * Result from a single evaluator on a single run
 * Contains score, metadata, and optional corrections
 */
interface EvaluationResult {
  /**
   * Evaluation key/name (e.g., "accuracy", "hallucination", "toxicity")
   * Used to identify this evaluation in results
   */
  key?: string;

  /**
   * Numeric or boolean score
   * - Number: typically 0-1 range, or any numeric metric
   * - Boolean: true for pass, false for fail
   */
  score?: number | boolean;

  /**
   * Additional value of any type
   * Can be used for detailed breakdowns, classifications, etc.
   */
  value?: string | number | boolean | object;

  /**
   * Human-readable comment explaining the evaluation
   */
  comment?: string;

  /**
   * Optional correction data
   * Can contain the corrected output or suggestions
   */
  correction?: object;

  /**
   * Metadata about the evaluator itself
   * E.g., model version, prompt version, configuration
   */
  evaluatorInfo?: object;

  /**
   * Source run ID if this evaluation came from another traced function
   */
  sourceRunId?: string;
}

Usage Examples:

// Simple pass/fail evaluation
const result1: EvaluationResult = {
  key: "correctness",
  score: 1,
  comment: "Answer matches expected output"
};

// Numeric score with details
const result2: EvaluationResult = {
  key: "relevance",
  score: 0.85,
  value: { precision: 0.9, recall: 0.8 },
  comment: "High relevance with good balance"
};

// Evaluation with correction
const result3: EvaluationResult = {
  key: "grammar",
  score: 0,
  comment: "Multiple grammar errors found",
  correction: {
    correctedText: "The corrected version...",
    errors: ["subject-verb agreement", "missing comma"]
  }
};

// Evaluation with evaluator metadata
const result4: EvaluationResult = {
  key: "quality",
  score: 0.92,
  evaluatorInfo: {
    model: "gpt-4",
    promptVersion: "v3.2",
    temperature: 0.1
  },
  sourceRunId: "run-abc123"
};

// Boolean evaluation
const result5: EvaluationResult = {
  key: "has_citations",
  score: true,
  value: ["source1", "source2", "source3"],
  comment: "Found 3 citations"
};

GradingFunctionParams Interface

Parameters passed to grading functions for simple evaluations.

/**
 * Parameters for grading functions
 * Simplified interface for evaluators that don't need full run context
 */
interface GradingFunctionParams {
  /** Input data sent to the function being evaluated */
  input?: any;

  /** Output/prediction from the function being evaluated */
  prediction?: any;

  /** Expected answer from the dataset example */
  answer?: any;

  /** Reference data from the dataset example (alias for answer) */
  reference?: any;
}

Usage Examples:

// Simple grading function
function accuracyGrader(params: GradingFunctionParams): EvaluationResult {
  const correct = params.prediction === params.answer;
  return { key: "accuracy", score: correct ? 1 : 0 };
}

// Grading function with partial credit
function partialMatchGrader(params: GradingFunctionParams): EvaluationResult {
  const pred = String(params.prediction).toLowerCase();
  const ans = String(params.answer).toLowerCase();

  if (pred === ans) {
    return { key: "match", score: 1, comment: "Exact match" };
  } else if (pred.includes(ans) || ans.includes(pred)) {
    return { key: "match", score: 0.5, comment: "Partial match" };
  } else {
    return { key: "match", score: 0, comment: "No match" };
  }
}

// Grading function using input context
function contextAwareGrader(params: GradingFunctionParams): EvaluationResult {
  const inputType = params.input?.type;
  const prediction = params.prediction;

  // Different grading logic based on input type
  let score = 0;
  if (inputType === "classification") {
    score = prediction === params.answer ? 1 : 0;
  } else if (inputType === "generation") {
    score = computeSimilarity(prediction, params.answer);
  }

  return { key: "score", score };
}

Evaluator Types

Type definitions for different evaluator patterns.

/**
 * Evaluator type - can be a function or RunEvaluator instance
 */
type EvaluatorT = RunEvaluatorLike | RunEvaluator;

/**
 * Run evaluator function type
 * Takes a run and optional example, returns evaluation result
 */
type RunEvaluatorLike = (
  run: Run,
  example?: Example
) => EvaluationResult | Promise<EvaluationResult>;

/**
 * Summary evaluator type
 * Runs once on all results to compute aggregate metrics
 */
type SummaryEvaluatorT = (
  results: EvaluationResult[]
) => EvaluationResult | Promise<EvaluationResult>;

/**
 * Comparative evaluator type
 * Compares multiple runs for the same example across experiments
 */
type ComparativeEvaluator = (
  runs: Run[],
  example: Example
) => Promise<ComparisonEvaluationResult> | ComparisonEvaluationResult;

Usage Examples:

// RunEvaluatorLike - simple function
const simpleEvaluator: RunEvaluatorLike = (run, example) => ({
  key: "simple",
  score: 1
});

// RunEvaluatorLike - async function
const asyncEvaluator: RunEvaluatorLike = async (run, example) => {
  const result = await externalValidation(run.outputs);
  return { key: "validation", score: result.passed ? 1 : 0 };
};

// SummaryEvaluatorT - aggregate metrics
const averageScoreEvaluator: SummaryEvaluatorT = (results) => {
  const scores = results
    .filter(r => typeof r.score === "number")
    .map(r => r.score as number);

  const avg = scores.reduce((a, b) => a + b, 0) / scores.length;

  return {
    key: "average_score",
    score: avg,
    comment: `Average across ${scores.length} results`
  };
};

// ComparativeEvaluator - compare experiments
const winRateEvaluator: ComparativeEvaluator = (runs, example) => {
  const scores = runs.map(run => computeQuality(run.outputs));
  const maxScore = Math.max(...scores);
  const winnerIdx = scores.indexOf(maxScore);

  return {
    key: "winner",
    scores,
    value: String.fromCharCode(65 + winnerIdx), // A, B, C, etc.
    comment: `Experiment ${winnerIdx + 1} performed best`
  };
};

// Use all evaluator types
await evaluate(myApp, {
  data: "dataset",
  evaluators: [simpleEvaluator, asyncEvaluator],
  summary_evaluators: [averageScoreEvaluator]
});

await evaluateComparative(
  ["exp-a", "exp-b"],
  { comparativeEvaluators: [winRateEvaluator] }
);

Category Class

Categorical classification helper for evaluation results.

/**
 * Category class for categorical classifications in evaluation
 * Used to represent classification results with confidence scores
 */
class Category {
  /** The category label */
  readonly category: string;

  /** Confidence score for this category (0-1) */
  readonly confidence?: number;

  constructor(category: string, confidence?: number);
}

Usage Examples:

import { Category, evaluate } from "langsmith/evaluation";

// Classification evaluator using Category
const sentimentEvaluator = ({ run, example }) => {
  const text = run.outputs?.text || "";

  // Classify sentiment
  const sentiment = classifySentiment(text);

  return {
    key: "sentiment",
    value: new Category(sentiment.label, sentiment.confidence),
    score: sentiment.label === example?.outputs?.sentiment ? 1 : 0,
    comment: `Predicted: ${sentiment.label} (${sentiment.confidence.toFixed(2)})`
  };
};

// Multi-class classification
const topicEvaluator = async ({ run, example }) => {
  const text = run.outputs?.text || "";
  const predictions = await classifyTopics(text);

  // Return top prediction as Category
  const topPrediction = predictions[0];
  const category = new Category(topPrediction.topic, topPrediction.score);

  return {
    key: "topic",
    value: category,
    score: topPrediction.topic === example?.outputs?.topic ? 1 : 0,
    comment: `Top: ${topPrediction.topic} (${topPrediction.score.toFixed(2)})`
  };
};

// Use in evaluation
await evaluate(classifier, {
  data: "classification-dataset",
  evaluators: [sentimentEvaluator, topicEvaluator]
});

LangChain Evaluators (Deprecated)

GetLangchainStringEvaluator (Deprecated)

Note: This function is deprecated. Use the evaluate() function with custom evaluators instead.

Load a LangChain string evaluator for use in LangSmith evaluations.

/**
 * Load a LangChain string evaluator (DEPRECATED)
 * @deprecated Use evaluate() with custom evaluators instead
 * @param type - Evaluator type ("criteria" or "labeled_criteria")
 * @param options - LangChain evaluator options
 * @returns Promise resolving to evaluator function
 */
function getLangchainStringEvaluator(
  type: "criteria" | "labeled_criteria",
  options?: {
    criteria?: string | Record<string, string>;
    llm?: any;
    formatEvaluatorInputs?: (run: Run, example?: Example) => any;
  }
): Promise<(run: Run, example?: Example) => Promise<EvaluationResult>>;

Usage Example (Deprecated):

import { getLangchainStringEvaluator } from "langsmith/evaluation/langchain";

// Load criteria-based evaluator
const evaluator = await getLangchainStringEvaluator("criteria", {
  criteria: "helpfulness",
  formatEvaluatorInputs: (run, example) => ({
    input: run.inputs.question,
    prediction: run.outputs.answer,
  }),
});

// Use in evaluation (consider using evaluate() directly instead)
const result = await evaluator(run, example);

Evaluation

Advanced Patterns

Custom Dataset Iteration

Evaluate using custom dataset iteration for large datasets or streaming scenarios.

import { evaluate } from "langsmith/evaluation";

// Async generator for large datasets
async function* generateExamples() {
  for (let i = 0; i < 10000; i++) {
    yield {
      inputs: { id: i, question: `Question ${i}` },
      outputs: { answer: `Answer ${i}` }
    };
  }
}

await evaluate(myApp, {
  data: generateExamples(),
  evaluators: [accuracyEvaluator],
  max_concurrency: 20,
  experiment_name: "large-scale-eval"
});

Chained Evaluators

Create evaluators that depend on other evaluators' results.

// First evaluator checks correctness
const correctnessEvaluator = ({ run, example }) => {
  const correct = run.outputs?.answer === example?.outputs?.answer;
  return {
    key: "correctness",
    score: correct ? 1 : 0
  };
};

// Second evaluator only runs if first passed
const detailedEvaluator = async ({ run, example }) => {
  // Check if it was correct first
  const correctResult = correctnessEvaluator({ run, example });

  if (correctResult.score === 0) {
    return {
      key: "detailed_analysis",
      score: 0,
      comment: "Skipped - incorrect answer"
    };
  }

  // Do detailed analysis only on correct answers
  const quality = await analyzeQuality(run.outputs?.answer);
  return {
    key: "detailed_analysis",
    score: quality,
    comment: "Passed correctness, analyzed quality"
  };
};

LLM-as-Judge Evaluators

Use LLMs to evaluate outputs based on complex criteria.

const llmJudgeEvaluator = async ({ run, example }) => {
  const output = run.outputs?.answer || "";
  const input = run.inputs?.question || "";
  const expected = example?.outputs?.answer || "";

  const judgmentPrompt = `
You are an expert evaluator. Assess this response:

Question: ${input}
Expected Answer: ${expected}
Actual Answer: ${output}

Rate the response on:
1. Accuracy (0-1)
2. Completeness (0-1)
3. Clarity (0-1)

Return JSON: { "accuracy": 0.0-1.0, "completeness": 0.0-1.0, "clarity": 0.0-1.0, "reasoning": "..." }
`;

  const judgment = await callLLM(judgmentPrompt);
  const parsed = JSON.parse(judgment);

  const overallScore = (
    parsed.accuracy +
    parsed.completeness +
    parsed.clarity
  ) / 3;

  return {
    key: "llm_judge",
    score: overallScore,
    value: parsed,
    comment: parsed.reasoning,
    evaluatorInfo: {
      model: "gpt-4",
      type: "llm-as-judge"
    }
  };
};

Cost-Aware Evaluation

Track costs during evaluation for budget monitoring.

const costTrackingEvaluator = ({ run }) => {
  // Extract token usage from run metadata
  const inputTokens = run.extra?.usage?.input_tokens || 0;
  const outputTokens = run.extra?.usage?.output_tokens || 0;

  // Calculate cost (example rates)
  const inputCost = inputTokens * 0.00003;  // $0.03 per 1K tokens
  const outputCost = outputTokens * 0.00006; // $0.06 per 1K tokens
  const totalCost = inputCost + outputCost;

  return {
    key: "cost",
    score: totalCost < 0.10 ? 1 : 0, // Flag runs over $0.10
    value: {
      totalCost,
      inputTokens,
      outputTokens
    },
    comment: `$${totalCost.toFixed(4)} (${inputTokens}+${outputTokens} tokens)`
  };
};

// Summary evaluator for total cost
const totalCostEvaluator = (results) => {
  const totalCost = results
    .filter(r => r.key === "cost")
    .reduce((sum, r) => sum + ((r.value as any)?.totalCost || 0), 0);

  return {
    key: "total_cost",
    value: totalCost,
    comment: `Total evaluation cost: $${totalCost.toFixed(2)}`
  };
};

await evaluate(myApp, {
  data: "dataset",
  evaluators: [costTrackingEvaluator],
  summary_evaluators: [totalCostEvaluator]
});

Multi-Turn Conversation Evaluation

Evaluate conversational AI with context across multiple turns.

const conversationEvaluator = async ({ run, example }) => {
  const turns = run.outputs?.conversation || [];

  // Evaluate coherence across turns
  let coherenceScore = 1.0;
  for (let i = 1; i < turns.length; i++) {
    const similarity = await computeCoherence(turns[i-1], turns[i]);
    coherenceScore = Math.min(coherenceScore, similarity);
  }

  // Evaluate goal completion
  const goalAchieved = await checkGoalCompletion(
    turns,
    example?.outputs?.goal
  );

  return {
    key: "conversation_quality",
    score: (coherenceScore + (goalAchieved ? 1 : 0)) / 2,
    value: {
      coherence: coherenceScore,
      goalAchieved,
      numTurns: turns.length
    },
    comment: `${turns.length} turns, coherence: ${coherenceScore.toFixed(2)}`
  };
};

Version

Tile

Files

evaluation.mddocs/

LangSmith Evaluation

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

Dataset Evaluation

Comparative Evaluation

Custom Evaluators

Evaluation Interfaces

Dataset Evaluation

Evaluate Function

EvaluateOptions Interface

EvaluationResults Interface

Comparative Evaluation

EvaluateComparative Function

EvaluateComparativeOptions Interface

Custom Evaluators

Evaluator Functions

StringEvaluator Class

DynamicRunEvaluator Class

Evaluation Interfaces

EvaluationResult Interface

GradingFunctionParams Interface

Evaluator Types

Category Class

LangChain Evaluators (Deprecated)

GetLangchainStringEvaluator (Deprecated)

Advanced Patterns

Custom Dataset Iteration

Chained Evaluators

LLM-as-Judge Evaluators

Cost-Aware Evaluation

Multi-Turn Conversation Evaluation

Version

Tile

Files

evaluation.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

LangSmith Evaluation

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

Dataset Evaluation

Comparative Evaluation

Custom Evaluators

Evaluation Interfaces

Dataset Evaluation

Evaluate Function

EvaluateOptions Interface

EvaluationResults Interface

Comparative Evaluation

EvaluateComparative Function

EvaluateComparativeOptions Interface

Custom Evaluators

Evaluator Functions

StringEvaluator Class

DynamicRunEvaluator Class

Evaluation Interfaces

EvaluationResult Interface

GradingFunctionParams Interface

Evaluator Types

Category Class

LangChain Evaluators (Deprecated)

GetLangchainStringEvaluator (Deprecated)

Advanced Patterns

Custom Dataset Iteration

Chained Evaluators

LLM-as-Judge Evaluators

Cost-Aware Evaluation

Multi-Turn Conversation Evaluation

evaluation.mddocs/