tessl/npm-langfuse--client

Langfuse API client for universal JavaScript environments providing observability, prompt management, datasets, experiments, and scoring capabilities

Moderation error

Malicious code detected in tile.json: This tile.json exhibits signs of a supply chain attack through typosquatting/dependency confusion. The package name 'tessl/npm-langfuse--client' (with double hyphen) mimics the legitimate '@langfuse/client' package. The 'describes' field references the real package 'pkg:npm/%40langfuse/client@4.2.0' but the tile name uses a different format with 'npm-langfuse--client', suggesting an attempt to intercept or impersonate the legitimate Langfuse client package. This pattern is consistent with dependency confusion attacks where malicious packages use similar names to legitimate ones.

Overview

Eval results

Files

Dataset Operations

Name: tessl/npm-langfuse--client
Author: tessl

The Dataset Operations system provides comprehensive capabilities for working with evaluation datasets, linking them to traces and observations, and running experiments. Datasets are collections of input-output pairs used for systematic evaluation of LLM applications.

Capabilities

Get Dataset

Retrieve a dataset by name with all its items, link functions, and experiment functionality.

/**
 * Retrieves a dataset by name with all its items and experiment functionality
 *
 * Fetches a dataset and all its associated items with automatic pagination handling.
 * The returned dataset includes enhanced functionality for linking items to traces
 * and running experiments directly on the dataset.
 *
 * @param name - The name of the dataset to retrieve
 * @param options - Optional configuration for data fetching
 * @returns Promise resolving to enhanced dataset with items and experiment capabilities
 */
async get(
  name: string,
  options?: {
    /** Number of items to fetch per page (default: 50) */
    fetchItemsPageSize?: number;
  }
): Promise<FetchedDataset>;

Usage Examples:

import { LangfuseClient } from '@langfuse/client';

const langfuse = new LangfuseClient();

// Basic dataset retrieval
const dataset = await langfuse.dataset.get("my-evaluation-dataset");

console.log(`Dataset: ${dataset.name}`);
console.log(`Description: ${dataset.description}`);
console.log(`Items: ${dataset.items.length}`);
console.log(`Metadata:`, dataset.metadata);

// Access dataset items
for (const item of dataset.items) {
  console.log('Input:', item.input);
  console.log('Expected Output:', item.expectedOutput);
  console.log('Metadata:', item.metadata);
}

Handling Large Datasets:

// For large datasets, use smaller page sizes for better performance
const largeDataset = await langfuse.dataset.get(
  "large-benchmark-dataset",
  { fetchItemsPageSize: 100 }
);

console.log(`Loaded ${largeDataset.items.length} items`);

// Process items in batches
const batchSize = 10;
for (let i = 0; i < largeDataset.items.length; i += batchSize) {
  const batch = largeDataset.items.slice(i, i + batchSize);
  // Process batch...
}

Accessing Dataset Properties:

const dataset = await langfuse.dataset.get("qa-dataset");

// Dataset metadata
console.log(dataset.id);          // Dataset ID
console.log(dataset.name);        // Dataset name
console.log(dataset.description); // Description
console.log(dataset.metadata);    // Custom metadata
console.log(dataset.projectId);   // Project ID
console.log(dataset.createdAt);   // Creation timestamp
console.log(dataset.updatedAt);   // Last update timestamp

// Item properties
const item = dataset.items[0];
console.log(item.id);                  // Item ID
console.log(item.datasetId);           // Parent dataset ID
console.log(item.input);               // Input data
console.log(item.expectedOutput);      // Expected output
console.log(item.metadata);            // Item metadata
console.log(item.sourceTraceId);       // Source trace (if any)
console.log(item.sourceObservationId); // Source observation (if any)
console.log(item.status);              // Status (ACTIVE or ARCHIVED)

Types

FetchedDataset

Enhanced dataset object with additional methods for linking and experiments.

/**
 * Enhanced dataset with linking and experiment functionality
 *
 * Extends the base Dataset type with:
 * - Array of items with link functions for connecting to traces
 * - runExperiment method for executing experiments directly on the dataset
 *
 * @public
 */
type FetchedDataset = Dataset & {
  /** Dataset items with link functionality for connecting to traces */
  items: (DatasetItem & { link: LinkDatasetItemFunction })[];

  /** Function to run experiments directly on this dataset */
  runExperiment: RunExperimentOnDataset;
};

Properties from Dataset:

interface Dataset {
  /** Unique identifier for the dataset */
  id: string;

  /** Human-readable name for the dataset */
  name: string;

  /** Optional description explaining the dataset's purpose */
  description?: string | null;

  /** Custom metadata attached to the dataset */
  metadata?: Record<string, any> | null;

  /** Project ID this dataset belongs to */
  projectId: string;

  /** Timestamp when the dataset was created */
  createdAt: string;

  /** Timestamp when the dataset was last updated */
  updatedAt: string;
}

DatasetItem

Individual item within a dataset containing input, expected output, and metadata.

/**
 * Dataset item with input/output pair for evaluation
 *
 * Represents a single test case within a dataset. Each item can contain
 * any type of input and expected output, along with optional metadata
 * and linkage to source traces/observations.
 *
 * @public
 */
interface DatasetItem {
  /** Unique identifier for the dataset item */
  id: string;

  /** ID of the parent dataset */
  datasetId: string;

  /** Name of the parent dataset */
  datasetName: string;

  /** Input data (can be any type: string, object, array, etc.) */
  input?: any;

  /** Expected output for evaluation (can be any type) */
  expectedOutput?: any;

  /** Custom metadata for this item */
  metadata?: Record<string, any> | null;

  /** ID of the trace this item was created from (if applicable) */
  sourceTraceId?: string | null;

  /** ID of the observation this item was created from (if applicable) */
  sourceObservationId?: string | null;

  /** Status of the item (ACTIVE or ARCHIVED) */
  status: "ACTIVE" | "ARCHIVED";

  /** Timestamp when the item was created */
  createdAt: string;

  /** Timestamp when the item was last updated */
  updatedAt: string;
}

LinkDatasetItemFunction

Function type for linking dataset items to OpenTelemetry spans for tracking experiments.

/**
 * Links dataset items to OpenTelemetry spans
 *
 * Creates a connection between a dataset item and a trace/observation,
 * enabling tracking of which dataset items were used in which experiments.
 * This is essential for creating dataset runs and tracking experiment lineage.
 *
 * @param obj - Object containing the OpenTelemetry span
 * @param obj.otelSpan - The OpenTelemetry span from a Langfuse observation
 * @param runName - Name of the experiment run for grouping related items
 * @param runArgs - Optional configuration for the dataset run
 * @returns Promise resolving to the created dataset run item
 *
 * @public
 */
type LinkDatasetItemFunction = (
  obj: { otelSpan: Span },
  runName: string,
  runArgs?: {
    /** Description of the dataset run */
    description?: string;

    /** Additional metadata for the dataset run */
    metadata?: any;
  }
) => Promise<DatasetRunItem>;

DatasetRunItem

Result of linking a dataset item to a trace execution.

/**
 * Linked dataset run item
 *
 * Represents the connection between a dataset item and a specific
 * trace execution within a dataset run. Used for tracking experiment results.
 *
 * @public
 */
interface DatasetRunItem {
  /** Unique identifier for the run item */
  id: string;

  /** ID of the dataset run this item belongs to */
  datasetRunId: string;

  /** Name of the dataset run this item belongs to */
  datasetRunName: string;

  /** ID of the dataset item */
  datasetItemId: string;

  /** ID of the trace this run item is linked to */
  traceId: string;

  /** Optional ID of the observation this run item is linked to */
  observationId?: string;

  /** Timestamp when the run item was created */
  createdAt: string;

  /** Timestamp when the run item was last updated */
  updatedAt: string;
}

RunExperimentOnDataset

Function type for running experiments directly on fetched datasets.

/**
 * Runs experiments on Langfuse datasets
 *
 * This function type is attached to fetched datasets to enable convenient
 * experiment execution. The data parameter is automatically provided from
 * the dataset items.
 *
 * @param params - Experiment parameters (excluding data)
 * @returns Promise resolving to experiment results
 *
 * @public
 */
type RunExperimentOnDataset = (
  params: Omit<ExperimentParams<any, any, Record<string, any>>, "data">
) => Promise<ExperimentResult<any, any, Record<string, any>>>;

Usage Patterns

Basic Dataset Retrieval and Exploration

import { LangfuseClient } from '@langfuse/client';

const langfuse = new LangfuseClient({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
});

// Fetch dataset
const dataset = await langfuse.dataset.get("customer-support-qa");

console.log(`Dataset: ${dataset.name}`);
console.log(`Total items: ${dataset.items.length}`);

// Explore items
dataset.items.forEach((item, index) => {
  console.log(`\nItem ${index + 1}:`);
  console.log('  Input:', item.input);
  console.log('  Expected:', item.expectedOutput);

  if (item.metadata) {
    console.log('  Metadata:', item.metadata);
  }
});

Linking Dataset Items to Traces

Link dataset items to trace executions to create dataset runs and track experiment results.

import { LangfuseClient } from '@langfuse/client';
import { startObservation } from '@langfuse/tracing';

const langfuse = new LangfuseClient();

// Fetch dataset
const dataset = await langfuse.dataset.get("qa-benchmark");
const runName = "gpt-4-evaluation-v1";

// Process each item and link to traces
for (const item of dataset.items) {
  // Create a trace for this execution
  const span = startObservation("qa-task", {
    input: item.input,
    metadata: { datasetItemId: item.id }
  });

  try {
    // Execute your task
    const output = await runYourTask(item.input);

    // Update trace with output
    span.update({ output });

    // Link dataset item to this trace
    await item.link(span, runName);

  } catch (error) {
    // Handle errors
    span.update({
      output: { error: String(error) },
      level: "ERROR"
    });

    // Still link the item (to track failures)
    await item.link(span, runName);
  } finally {
    span.end();
  }
}

console.log(`Completed dataset run: ${runName}`);

Linking with Run Metadata

Add descriptions and metadata to dataset runs for better organization.

const dataset = await langfuse.dataset.get("model-comparison");
const runName = "claude-3-opus-eval";

for (const item of dataset.items) {
  const span = startObservation("evaluation-task", {
    input: item.input
  });

  const output = await evaluateWithClaude(item.input);
  span.update({ output });
  span.end();

  // Link with descriptive metadata
  await item.link(span, runName, {
    description: "Claude 3 Opus evaluation on reasoning tasks",
    metadata: {
      modelVersion: "claude-3-opus-20240229",
      temperature: 0.7,
      maxTokens: 1000,
      timestamp: new Date().toISOString(),
      experimentGroup: "reasoning-tasks"
    }
  });
}

Linking Nested Observations

Link dataset items to specific observations within a trace hierarchy.

const dataset = await langfuse.dataset.get("translation-dataset");
const runName = "translation-pipeline-v2";

for (const item of dataset.items) {
  // Create parent trace
  const trace = startObservation("translation-pipeline", {
    input: item.input
  });

  // Create preprocessing observation
  const preprocessor = trace.startObservation("preprocessing", {
    input: item.input
  });
  const preprocessed = await preprocess(item.input);
  preprocessor.update({ output: preprocessed });
  preprocessor.end();

  // Create translation observation (the main task)
  const translator = trace.startObservation("translation", {
    input: preprocessed,
    model: "gpt-4"
  }, { asType: "generation" });

  const translated = await translate(preprocessed);
  translator.update({ output: translated });
  translator.end();

  // Create postprocessing observation
  const postprocessor = trace.startObservation("postprocessing", {
    input: translated
  });
  const final = await postprocess(translated);
  postprocessor.update({ output: final });
  postprocessor.end();

  trace.update({ output: final });
  trace.end();

  // Link to the specific translation observation
  await item.link({ otelSpan: translator.otelSpan }, runName, {
    description: "Translation quality evaluation",
    metadata: { pipeline: "v2", stage: "translation" }
  });
}

Running Experiments on Datasets

Execute experiments directly on datasets with automatic tracing and evaluation.

import { LangfuseClient } from '@langfuse/client';
import { observeOpenAI } from '@langfuse/openai';
import OpenAI from 'openai';

const langfuse = new LangfuseClient();

// Fetch dataset
const dataset = await langfuse.dataset.get("capital-cities");

// Define task
const task = async ({ input }: { input: string }) => {
  const client = observeOpenAI(new OpenAI());

  const response = await client.chat.completions.create({
    model: "gpt-4",
    messages: [
      { role: "user", content: `What is the capital of ${input}?` }
    ]
  });

  return response.choices[0].message.content;
};

// Define evaluator
const exactMatchEvaluator = async ({ output, expectedOutput }) => ({
  name: "exact_match",
  value: output === expectedOutput ? 1 : 0
});

// Run experiment
const result = await dataset.runExperiment({
  name: "Capital Cities Evaluation",
  runName: "gpt-4-baseline",
  description: "Baseline evaluation with GPT-4",
  task,
  evaluators: [exactMatchEvaluator],
  maxConcurrency: 5
});

// View results
console.log(await result.format());
console.log(`Dataset run URL: ${result.datasetRunUrl}`);

Advanced Experiment with Multiple Evaluators

import { LangfuseClient, Evaluator } from '@langfuse/client';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
import { Levenshtein, Factuality } from 'autoevals';

const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("qa-dataset");

// Custom evaluator using OpenAI
const semanticSimilarityEvaluator: Evaluator = async ({
  output,
  expectedOutput
}) => {
  const openai = new OpenAI();

  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "user",
        content: `Rate the semantic similarity between these two answers on a scale of 0 to 1:

Answer 1: ${output}
Answer 2: ${expectedOutput}

Respond with just a number between 0 and 1.`
      }
    ]
  });

  const score = parseFloat(response.choices[0].message.content || "0");

  return {
    name: "semantic_similarity",
    value: score,
    comment: `Comparison between output and expected output`
  };
};

// Run experiment with multiple evaluators
const result = await dataset.runExperiment({
  name: "Multi-Evaluator Experiment",
  runName: "comprehensive-eval-v1",
  task: myTask,
  evaluators: [
    // AutoEvals evaluators
    createEvaluatorFromAutoevals(Levenshtein),
    createEvaluatorFromAutoevals(Factuality),

    // Custom evaluator
    semanticSimilarityEvaluator
  ]
});

// Analyze results
console.log(await result.format({ includeItemResults: true }));

// Access individual scores
result.itemResults.forEach((item, index) => {
  console.log(`\nItem ${index + 1}:`);
  console.log('Input:', item.input);
  console.log('Output:', item.output);
  console.log('Expected:', item.expectedOutput);
  console.log('Evaluations:');

  item.evaluations.forEach(evaluation => {
    console.log(`  ${evaluation.name}: ${evaluation.value}`);
    if (evaluation.comment) {
      console.log(`    Comment: ${evaluation.comment}`);
    }
  });
});

Experiment with Run-Level Evaluators

Use run-level evaluators to compute aggregate statistics across all items.

import { LangfuseClient, RunEvaluator } from '@langfuse/client';

const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("benchmark-dataset");

// Define a run-level evaluator for computing averages
const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {
  const scores = itemResults
    .flatMap(result => result.evaluations)
    .filter(eval => eval.name === "accuracy")
    .map(eval => eval.value as number);

  const average = scores.reduce((sum, score) => sum + score, 0) / scores.length;

  return {
    name: "average_accuracy",
    value: average,
    comment: `Average accuracy across ${scores.length} items`
  };
};

// Run experiment
const result = await dataset.runExperiment({
  name: "Accuracy Benchmark",
  task: myTask,
  evaluators: [accuracyEvaluator],
  runEvaluators: [averageScoreEvaluator]
});

// Check aggregate results
console.log('Run-level evaluations:');
result.runEvaluations.forEach(evaluation => {
  console.log(`${evaluation.name}: ${evaluation.value}`);
  if (evaluation.comment) {
    console.log(`  ${evaluation.comment}`);
  }
});

Comparing Multiple Models

Run experiments on the same dataset with different models for comparison.

import { LangfuseClient } from '@langfuse/client';
import OpenAI from 'openai';

const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("reasoning-tasks");

const openai = new OpenAI();

// Define models to compare
const models = [
  "gpt-4",
  "gpt-3.5-turbo",
  "gpt-4-turbo-preview"
];

const evaluator = async ({ output, expectedOutput }) => ({
  name: "correctness",
  value: evaluateCorrectness(output, expectedOutput)
});

// Run experiment for each model
const results = [];

for (const model of models) {
  const result = await dataset.runExperiment({
    name: "Model Comparison",
    runName: `${model}-evaluation`,
    description: `Evaluation with ${model}`,
    metadata: { model },
    task: async ({ input }) => {
      const response = await openai.chat.completions.create({
        model,
        messages: [{ role: "user", content: input }]
      });
      return response.choices[0].message.content;
    },
    evaluators: [evaluator],
    maxConcurrency: 3
  });

  results.push({ model, result });
  console.log(`Completed: ${model}`);
  console.log(await result.format());
}

// Compare results
console.log("\n=== Model Comparison Summary ===");
results.forEach(({ model, result }) => {
  const avgScore = result.itemResults
    .flatMap(r => r.evaluations)
    .reduce((sum, e) => sum + (e.value as number), 0) / result.itemResults.length;

  console.log(`${model}: ${avgScore.toFixed(3)}`);
  console.log(`  URL: ${result.datasetRunUrl}`);
});

Incremental Dataset Processing

Process datasets incrementally with checkpointing for long-running experiments.

import { LangfuseClient } from '@langfuse/client';
import { startObservation } from '@langfuse/tracing';
import * as fs from 'fs';

const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("large-dataset");
const runName = "incremental-processing-v1";

// Load checkpoint if exists
const checkpointFile = './checkpoint.json';
let processedIds = new Set<string>();

if (fs.existsSync(checkpointFile)) {
  const checkpoint = JSON.parse(fs.readFileSync(checkpointFile, 'utf-8'));
  processedIds = new Set(checkpoint.processedIds);
  console.log(`Resuming from checkpoint: ${processedIds.size} items processed`);
}

// Process items
for (const [index, item] of dataset.items.entries()) {
  // Skip already processed items
  if (processedIds.has(item.id)) {
    continue;
  }

  console.log(`Processing item ${index + 1}/${dataset.items.length}`);

  try {
    const span = startObservation("processing-task", {
      input: item.input,
      metadata: { itemId: item.id }
    });

    const output = await processItem(item.input);
    span.update({ output });
    span.end();

    await item.link(span, runName, {
      metadata: { batchIndex: Math.floor(index / 100) }
    });

    // Update checkpoint
    processedIds.add(item.id);
    fs.writeFileSync(
      checkpointFile,
      JSON.stringify({ processedIds: Array.from(processedIds) })
    );

  } catch (error) {
    console.error(`Error processing item ${item.id}:`, error);
    // Continue with next item
  }
}

console.log(`Completed processing ${processedIds.size} items`);

// Clean up checkpoint
fs.unlinkSync(checkpointFile);

Parallel Processing with Concurrency Control

import { LangfuseClient } from '@langfuse/client';
import { startObservation } from '@langfuse/tracing';
import pLimit from 'p-limit';

const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("parallel-dataset");
const runName = "parallel-processing-v1";

// Limit concurrent operations
const limit = pLimit(10);

// Process items in parallel with concurrency limit
const tasks = dataset.items.map(item =>
  limit(async () => {
    const span = startObservation("parallel-task", {
      input: item.input
    });

    try {
      const output = await processItem(item.input);
      span.update({ output });

      await item.link(span, runName);

      return { success: true, itemId: item.id };
    } catch (error) {
      span.update({
        output: { error: String(error) },
        level: "ERROR"
      });

      await item.link(span, runName);

      return { success: false, itemId: item.id, error };
    } finally {
      span.end();
    }
  })
);

// Wait for all tasks to complete
const results = await Promise.all(tasks);

// Summarize results
const successful = results.filter(r => r.success).length;
const failed = results.filter(r => !r.success).length;

console.log(`Completed: ${successful} successful, ${failed} failed`);

Integration with LangChain

Use datasets with LangChain applications for systematic evaluation.

import { LangfuseClient } from '@langfuse/client';
import { startObservation } from '@langfuse/tracing';
import { ChatOpenAI } from '@langchain/openai';
import { PromptTemplate } from '@langchain/core/prompts';
import { StringOutputParser } from '@langchain/core/output_parsers';

const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("langchain-eval");

// Create LangChain components
const prompt = PromptTemplate.fromTemplate(
  "Translate the following to French: {text}"
);
const model = new ChatOpenAI({ modelName: "gpt-4" });
const outputParser = new StringOutputParser();
const chain = prompt.pipe(model).pipe(outputParser);

const runName = "langchain-translation-eval";

// Process each dataset item
for (const item of dataset.items) {
  // Create trace for this execution
  const span = startObservation("langchain-execution", {
    input: { text: item.input },
    metadata: { chainType: "translation" }
  });

  try {
    // Execute chain
    const result = await chain.invoke({ text: item.input });

    // Update trace with output
    span.update({ output: result });

    // Link dataset item
    await item.link(span, runName, {
      description: "LangChain translation evaluation"
    });

    // Score the result
    langfuse.score.observation(span, {
      name: "translation_quality",
      value: computeQuality(result, item.expectedOutput)
    });

  } catch (error) {
    span.update({
      output: { error: String(error) },
      level: "ERROR"
    });

    await item.link(span, runName);
  }

  span.end();
}

// Flush scores
await langfuse.flush();

Using Dataset Experiments with Custom Data Structures

import { LangfuseClient } from '@langfuse/client';

const langfuse = new LangfuseClient();

// Fetch dataset with structured inputs
const dataset = await langfuse.dataset.get("structured-qa");

// Task that handles structured input
const task = async ({ input }) => {
  // Input is an object with specific structure
  const { question, context } = input;

  const response = await callLLM({
    systemPrompt: "Answer questions based on the context.",
    userPrompt: `Context: ${context}\n\nQuestion: ${question}`
  });

  return response;
};

// Evaluator that handles structured output
const evaluator = async ({ input, output, expectedOutput }) => {
  const { question } = input;

  // Complex evaluation logic
  const scores = {
    accuracy: evaluateAccuracy(output, expectedOutput),
    relevance: evaluateRelevance(output, question),
    completeness: evaluateCompleteness(output, expectedOutput)
  };

  // Return multiple evaluations
  return [
    { name: "accuracy", value: scores.accuracy },
    { name: "relevance", value: scores.relevance },
    { name: "completeness", value: scores.completeness },
    {
      name: "overall",
      value: (scores.accuracy + scores.relevance + scores.completeness) / 3,
      metadata: { breakdown: scores }
    }
  ];
};

// Run experiment
const result = await dataset.runExperiment({
  name: "Structured QA Evaluation",
  task,
  evaluators: [evaluator]
});

console.log(await result.format({ includeItemResults: true }));

Best Practices

Dataset Organization

Use descriptive names: Name datasets clearly to indicate their purpose (e.g., "customer-support-qa-v2", "translation-benchmark-2024")
Add metadata: Include relevant context in dataset and item metadata for filtering and analysis
Version datasets: Create new dataset versions when making significant changes rather than modifying existing ones
Document expected outputs: Always provide expected outputs when available to enable automatic evaluation

Linking Strategy

Consistent run names: Use consistent naming conventions for dataset runs (e.g., "model-name-YYYY-MM-DD-version")
Add descriptions: Include run descriptions to document the purpose and configuration of each evaluation
Use metadata: Attach relevant metadata (model versions, hyperparameters, etc.) to enable comparison and filtering
Link to specific observations: When evaluating specific steps in a pipeline, link to the relevant observation rather than the root trace

Performance Optimization

Adjust page size: For large datasets, tune fetchItemsPageSize based on your network and memory constraints
Control concurrency: Use maxConcurrency in experiments to avoid overwhelming APIs or resources
Batch processing: Process large datasets in batches with checkpointing for resilience
Parallel execution: Use parallel processing with concurrency limits for faster evaluation

Experiment Design

Start simple: Begin with basic evaluators and add complexity as needed
Use multiple evaluators: Combine different evaluation approaches (exact match, semantic similarity, factuality, etc.)
Include run-level evaluators: Compute aggregate statistics to understand overall performance
Track metadata: Include model versions, timestamps, and configuration in experiment metadata
Version experiments: Use versioned run names to track experiment iterations

Error Handling

Handle failures gracefully: Catch errors during task execution and still link items to track failures
Set appropriate timeouts: Configure reasonable timeouts to prevent hanging on slow operations
Log errors: Record error details in trace metadata for debugging
Continue on failure: Design experiments to continue processing remaining items even if some fail

Cost Management

Control concurrency: Limit concurrent API calls to manage rate limits and costs
Cache results: Store experiment results to avoid re-running expensive evaluations
Sample testing: Test on a subset of items before running full evaluations
Monitor usage: Track token usage and API calls through Langfuse traces

Integration with Experiments

Datasets integrate seamlessly with the experiment system. For detailed information about experiment execution, evaluators, and result analysis, see the Experiment Management documentation.

Key Integration Points

Automatic tracing: Experiments on datasets automatically create traces and link them to dataset runs
Dataset run tracking: All experiment executions on datasets are tracked as dataset runs in Langfuse
Result visualization: Dataset run results are available in the Langfuse UI with detailed analytics
Comparison tools: Compare multiple dataset runs to track improvements over time