Langfuse API client for universal JavaScript environments providing observability, prompt management, datasets, experiments, and scoring capabilities
The Dataset Operations system provides comprehensive capabilities for working with evaluation datasets, linking them to traces and observations, and running experiments. Datasets are collections of input-output pairs used for systematic evaluation of LLM applications.
Retrieve a dataset by name with all its items, link functions, and experiment functionality.
/**
* Retrieves a dataset by name with all its items and experiment functionality
*
* Fetches a dataset and all its associated items with automatic pagination handling.
* The returned dataset includes enhanced functionality for linking items to traces
* and running experiments directly on the dataset.
*
* @param name - The name of the dataset to retrieve
* @param options - Optional configuration for data fetching
* @returns Promise resolving to enhanced dataset with items and experiment capabilities
*/
async get(
name: string,
options?: {
/** Number of items to fetch per page (default: 50) */
fetchItemsPageSize?: number;
}
): Promise<FetchedDataset>;Usage Examples:
import { LangfuseClient } from '@langfuse/client';
const langfuse = new LangfuseClient();
// Basic dataset retrieval
const dataset = await langfuse.dataset.get("my-evaluation-dataset");
console.log(`Dataset: ${dataset.name}`);
console.log(`Description: ${dataset.description}`);
console.log(`Items: ${dataset.items.length}`);
console.log(`Metadata:`, dataset.metadata);
// Access dataset items
for (const item of dataset.items) {
console.log('Input:', item.input);
console.log('Expected Output:', item.expectedOutput);
console.log('Metadata:', item.metadata);
}Handling Large Datasets:
// For large datasets, use smaller page sizes for better performance
const largeDataset = await langfuse.dataset.get(
"large-benchmark-dataset",
{ fetchItemsPageSize: 100 }
);
console.log(`Loaded ${largeDataset.items.length} items`);
// Process items in batches
const batchSize = 10;
for (let i = 0; i < largeDataset.items.length; i += batchSize) {
const batch = largeDataset.items.slice(i, i + batchSize);
// Process batch...
}Accessing Dataset Properties:
const dataset = await langfuse.dataset.get("qa-dataset");
// Dataset metadata
console.log(dataset.id); // Dataset ID
console.log(dataset.name); // Dataset name
console.log(dataset.description); // Description
console.log(dataset.metadata); // Custom metadata
console.log(dataset.projectId); // Project ID
console.log(dataset.createdAt); // Creation timestamp
console.log(dataset.updatedAt); // Last update timestamp
// Item properties
const item = dataset.items[0];
console.log(item.id); // Item ID
console.log(item.datasetId); // Parent dataset ID
console.log(item.input); // Input data
console.log(item.expectedOutput); // Expected output
console.log(item.metadata); // Item metadata
console.log(item.sourceTraceId); // Source trace (if any)
console.log(item.sourceObservationId); // Source observation (if any)
console.log(item.status); // Status (ACTIVE or ARCHIVED)Enhanced dataset object with additional methods for linking and experiments.
/**
* Enhanced dataset with linking and experiment functionality
*
* Extends the base Dataset type with:
* - Array of items with link functions for connecting to traces
* - runExperiment method for executing experiments directly on the dataset
*
* @public
*/
type FetchedDataset = Dataset & {
/** Dataset items with link functionality for connecting to traces */
items: (DatasetItem & { link: LinkDatasetItemFunction })[];
/** Function to run experiments directly on this dataset */
runExperiment: RunExperimentOnDataset;
};Properties from Dataset:
interface Dataset {
/** Unique identifier for the dataset */
id: string;
/** Human-readable name for the dataset */
name: string;
/** Optional description explaining the dataset's purpose */
description?: string | null;
/** Custom metadata attached to the dataset */
metadata?: Record<string, any> | null;
/** Project ID this dataset belongs to */
projectId: string;
/** Timestamp when the dataset was created */
createdAt: string;
/** Timestamp when the dataset was last updated */
updatedAt: string;
}Individual item within a dataset containing input, expected output, and metadata.
/**
* Dataset item with input/output pair for evaluation
*
* Represents a single test case within a dataset. Each item can contain
* any type of input and expected output, along with optional metadata
* and linkage to source traces/observations.
*
* @public
*/
interface DatasetItem {
/** Unique identifier for the dataset item */
id: string;
/** ID of the parent dataset */
datasetId: string;
/** Name of the parent dataset */
datasetName: string;
/** Input data (can be any type: string, object, array, etc.) */
input?: any;
/** Expected output for evaluation (can be any type) */
expectedOutput?: any;
/** Custom metadata for this item */
metadata?: Record<string, any> | null;
/** ID of the trace this item was created from (if applicable) */
sourceTraceId?: string | null;
/** ID of the observation this item was created from (if applicable) */
sourceObservationId?: string | null;
/** Status of the item (ACTIVE or ARCHIVED) */
status: "ACTIVE" | "ARCHIVED";
/** Timestamp when the item was created */
createdAt: string;
/** Timestamp when the item was last updated */
updatedAt: string;
}Function type for linking dataset items to OpenTelemetry spans for tracking experiments.
/**
* Links dataset items to OpenTelemetry spans
*
* Creates a connection between a dataset item and a trace/observation,
* enabling tracking of which dataset items were used in which experiments.
* This is essential for creating dataset runs and tracking experiment lineage.
*
* @param obj - Object containing the OpenTelemetry span
* @param obj.otelSpan - The OpenTelemetry span from a Langfuse observation
* @param runName - Name of the experiment run for grouping related items
* @param runArgs - Optional configuration for the dataset run
* @returns Promise resolving to the created dataset run item
*
* @public
*/
type LinkDatasetItemFunction = (
obj: { otelSpan: Span },
runName: string,
runArgs?: {
/** Description of the dataset run */
description?: string;
/** Additional metadata for the dataset run */
metadata?: any;
}
) => Promise<DatasetRunItem>;Result of linking a dataset item to a trace execution.
/**
* Linked dataset run item
*
* Represents the connection between a dataset item and a specific
* trace execution within a dataset run. Used for tracking experiment results.
*
* @public
*/
interface DatasetRunItem {
/** Unique identifier for the run item */
id: string;
/** ID of the dataset run this item belongs to */
datasetRunId: string;
/** Name of the dataset run this item belongs to */
datasetRunName: string;
/** ID of the dataset item */
datasetItemId: string;
/** ID of the trace this run item is linked to */
traceId: string;
/** Optional ID of the observation this run item is linked to */
observationId?: string;
/** Timestamp when the run item was created */
createdAt: string;
/** Timestamp when the run item was last updated */
updatedAt: string;
}Function type for running experiments directly on fetched datasets.
/**
* Runs experiments on Langfuse datasets
*
* This function type is attached to fetched datasets to enable convenient
* experiment execution. The data parameter is automatically provided from
* the dataset items.
*
* @param params - Experiment parameters (excluding data)
* @returns Promise resolving to experiment results
*
* @public
*/
type RunExperimentOnDataset = (
params: Omit<ExperimentParams<any, any, Record<string, any>>, "data">
) => Promise<ExperimentResult<any, any, Record<string, any>>>;import { LangfuseClient } from '@langfuse/client';
const langfuse = new LangfuseClient({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
});
// Fetch dataset
const dataset = await langfuse.dataset.get("customer-support-qa");
console.log(`Dataset: ${dataset.name}`);
console.log(`Total items: ${dataset.items.length}`);
// Explore items
dataset.items.forEach((item, index) => {
console.log(`\nItem ${index + 1}:`);
console.log(' Input:', item.input);
console.log(' Expected:', item.expectedOutput);
if (item.metadata) {
console.log(' Metadata:', item.metadata);
}
});Link dataset items to trace executions to create dataset runs and track experiment results.
import { LangfuseClient } from '@langfuse/client';
import { startObservation } from '@langfuse/tracing';
const langfuse = new LangfuseClient();
// Fetch dataset
const dataset = await langfuse.dataset.get("qa-benchmark");
const runName = "gpt-4-evaluation-v1";
// Process each item and link to traces
for (const item of dataset.items) {
// Create a trace for this execution
const span = startObservation("qa-task", {
input: item.input,
metadata: { datasetItemId: item.id }
});
try {
// Execute your task
const output = await runYourTask(item.input);
// Update trace with output
span.update({ output });
// Link dataset item to this trace
await item.link(span, runName);
} catch (error) {
// Handle errors
span.update({
output: { error: String(error) },
level: "ERROR"
});
// Still link the item (to track failures)
await item.link(span, runName);
} finally {
span.end();
}
}
console.log(`Completed dataset run: ${runName}`);Add descriptions and metadata to dataset runs for better organization.
const dataset = await langfuse.dataset.get("model-comparison");
const runName = "claude-3-opus-eval";
for (const item of dataset.items) {
const span = startObservation("evaluation-task", {
input: item.input
});
const output = await evaluateWithClaude(item.input);
span.update({ output });
span.end();
// Link with descriptive metadata
await item.link(span, runName, {
description: "Claude 3 Opus evaluation on reasoning tasks",
metadata: {
modelVersion: "claude-3-opus-20240229",
temperature: 0.7,
maxTokens: 1000,
timestamp: new Date().toISOString(),
experimentGroup: "reasoning-tasks"
}
});
}Link dataset items to specific observations within a trace hierarchy.
const dataset = await langfuse.dataset.get("translation-dataset");
const runName = "translation-pipeline-v2";
for (const item of dataset.items) {
// Create parent trace
const trace = startObservation("translation-pipeline", {
input: item.input
});
// Create preprocessing observation
const preprocessor = trace.startObservation("preprocessing", {
input: item.input
});
const preprocessed = await preprocess(item.input);
preprocessor.update({ output: preprocessed });
preprocessor.end();
// Create translation observation (the main task)
const translator = trace.startObservation("translation", {
input: preprocessed,
model: "gpt-4"
}, { asType: "generation" });
const translated = await translate(preprocessed);
translator.update({ output: translated });
translator.end();
// Create postprocessing observation
const postprocessor = trace.startObservation("postprocessing", {
input: translated
});
const final = await postprocess(translated);
postprocessor.update({ output: final });
postprocessor.end();
trace.update({ output: final });
trace.end();
// Link to the specific translation observation
await item.link({ otelSpan: translator.otelSpan }, runName, {
description: "Translation quality evaluation",
metadata: { pipeline: "v2", stage: "translation" }
});
}Execute experiments directly on datasets with automatic tracing and evaluation.
import { LangfuseClient } from '@langfuse/client';
import { observeOpenAI } from '@langfuse/openai';
import OpenAI from 'openai';
const langfuse = new LangfuseClient();
// Fetch dataset
const dataset = await langfuse.dataset.get("capital-cities");
// Define task
const task = async ({ input }: { input: string }) => {
const client = observeOpenAI(new OpenAI());
const response = await client.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "user", content: `What is the capital of ${input}?` }
]
});
return response.choices[0].message.content;
};
// Define evaluator
const exactMatchEvaluator = async ({ output, expectedOutput }) => ({
name: "exact_match",
value: output === expectedOutput ? 1 : 0
});
// Run experiment
const result = await dataset.runExperiment({
name: "Capital Cities Evaluation",
runName: "gpt-4-baseline",
description: "Baseline evaluation with GPT-4",
task,
evaluators: [exactMatchEvaluator],
maxConcurrency: 5
});
// View results
console.log(await result.format());
console.log(`Dataset run URL: ${result.datasetRunUrl}`);import { LangfuseClient, Evaluator } from '@langfuse/client';
import { createEvaluatorFromAutoevals } from '@langfuse/client';
import { Levenshtein, Factuality } from 'autoevals';
const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("qa-dataset");
// Custom evaluator using OpenAI
const semanticSimilarityEvaluator: Evaluator = async ({
output,
expectedOutput
}) => {
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "user",
content: `Rate the semantic similarity between these two answers on a scale of 0 to 1:
Answer 1: ${output}
Answer 2: ${expectedOutput}
Respond with just a number between 0 and 1.`
}
]
});
const score = parseFloat(response.choices[0].message.content || "0");
return {
name: "semantic_similarity",
value: score,
comment: `Comparison between output and expected output`
};
};
// Run experiment with multiple evaluators
const result = await dataset.runExperiment({
name: "Multi-Evaluator Experiment",
runName: "comprehensive-eval-v1",
task: myTask,
evaluators: [
// AutoEvals evaluators
createEvaluatorFromAutoevals(Levenshtein),
createEvaluatorFromAutoevals(Factuality),
// Custom evaluator
semanticSimilarityEvaluator
]
});
// Analyze results
console.log(await result.format({ includeItemResults: true }));
// Access individual scores
result.itemResults.forEach((item, index) => {
console.log(`\nItem ${index + 1}:`);
console.log('Input:', item.input);
console.log('Output:', item.output);
console.log('Expected:', item.expectedOutput);
console.log('Evaluations:');
item.evaluations.forEach(evaluation => {
console.log(` ${evaluation.name}: ${evaluation.value}`);
if (evaluation.comment) {
console.log(` Comment: ${evaluation.comment}`);
}
});
});Use run-level evaluators to compute aggregate statistics across all items.
import { LangfuseClient, RunEvaluator } from '@langfuse/client';
const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("benchmark-dataset");
// Define a run-level evaluator for computing averages
const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {
const scores = itemResults
.flatMap(result => result.evaluations)
.filter(eval => eval.name === "accuracy")
.map(eval => eval.value as number);
const average = scores.reduce((sum, score) => sum + score, 0) / scores.length;
return {
name: "average_accuracy",
value: average,
comment: `Average accuracy across ${scores.length} items`
};
};
// Run experiment
const result = await dataset.runExperiment({
name: "Accuracy Benchmark",
task: myTask,
evaluators: [accuracyEvaluator],
runEvaluators: [averageScoreEvaluator]
});
// Check aggregate results
console.log('Run-level evaluations:');
result.runEvaluations.forEach(evaluation => {
console.log(`${evaluation.name}: ${evaluation.value}`);
if (evaluation.comment) {
console.log(` ${evaluation.comment}`);
}
});Run experiments on the same dataset with different models for comparison.
import { LangfuseClient } from '@langfuse/client';
import OpenAI from 'openai';
const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("reasoning-tasks");
const openai = new OpenAI();
// Define models to compare
const models = [
"gpt-4",
"gpt-3.5-turbo",
"gpt-4-turbo-preview"
];
const evaluator = async ({ output, expectedOutput }) => ({
name: "correctness",
value: evaluateCorrectness(output, expectedOutput)
});
// Run experiment for each model
const results = [];
for (const model of models) {
const result = await dataset.runExperiment({
name: "Model Comparison",
runName: `${model}-evaluation`,
description: `Evaluation with ${model}`,
metadata: { model },
task: async ({ input }) => {
const response = await openai.chat.completions.create({
model,
messages: [{ role: "user", content: input }]
});
return response.choices[0].message.content;
},
evaluators: [evaluator],
maxConcurrency: 3
});
results.push({ model, result });
console.log(`Completed: ${model}`);
console.log(await result.format());
}
// Compare results
console.log("\n=== Model Comparison Summary ===");
results.forEach(({ model, result }) => {
const avgScore = result.itemResults
.flatMap(r => r.evaluations)
.reduce((sum, e) => sum + (e.value as number), 0) / result.itemResults.length;
console.log(`${model}: ${avgScore.toFixed(3)}`);
console.log(` URL: ${result.datasetRunUrl}`);
});Process datasets incrementally with checkpointing for long-running experiments.
import { LangfuseClient } from '@langfuse/client';
import { startObservation } from '@langfuse/tracing';
import * as fs from 'fs';
const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("large-dataset");
const runName = "incremental-processing-v1";
// Load checkpoint if exists
const checkpointFile = './checkpoint.json';
let processedIds = new Set<string>();
if (fs.existsSync(checkpointFile)) {
const checkpoint = JSON.parse(fs.readFileSync(checkpointFile, 'utf-8'));
processedIds = new Set(checkpoint.processedIds);
console.log(`Resuming from checkpoint: ${processedIds.size} items processed`);
}
// Process items
for (const [index, item] of dataset.items.entries()) {
// Skip already processed items
if (processedIds.has(item.id)) {
continue;
}
console.log(`Processing item ${index + 1}/${dataset.items.length}`);
try {
const span = startObservation("processing-task", {
input: item.input,
metadata: { itemId: item.id }
});
const output = await processItem(item.input);
span.update({ output });
span.end();
await item.link(span, runName, {
metadata: { batchIndex: Math.floor(index / 100) }
});
// Update checkpoint
processedIds.add(item.id);
fs.writeFileSync(
checkpointFile,
JSON.stringify({ processedIds: Array.from(processedIds) })
);
} catch (error) {
console.error(`Error processing item ${item.id}:`, error);
// Continue with next item
}
}
console.log(`Completed processing ${processedIds.size} items`);
// Clean up checkpoint
fs.unlinkSync(checkpointFile);import { LangfuseClient } from '@langfuse/client';
import { startObservation } from '@langfuse/tracing';
import pLimit from 'p-limit';
const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("parallel-dataset");
const runName = "parallel-processing-v1";
// Limit concurrent operations
const limit = pLimit(10);
// Process items in parallel with concurrency limit
const tasks = dataset.items.map(item =>
limit(async () => {
const span = startObservation("parallel-task", {
input: item.input
});
try {
const output = await processItem(item.input);
span.update({ output });
await item.link(span, runName);
return { success: true, itemId: item.id };
} catch (error) {
span.update({
output: { error: String(error) },
level: "ERROR"
});
await item.link(span, runName);
return { success: false, itemId: item.id, error };
} finally {
span.end();
}
})
);
// Wait for all tasks to complete
const results = await Promise.all(tasks);
// Summarize results
const successful = results.filter(r => r.success).length;
const failed = results.filter(r => !r.success).length;
console.log(`Completed: ${successful} successful, ${failed} failed`);Use datasets with LangChain applications for systematic evaluation.
import { LangfuseClient } from '@langfuse/client';
import { startObservation } from '@langfuse/tracing';
import { ChatOpenAI } from '@langchain/openai';
import { PromptTemplate } from '@langchain/core/prompts';
import { StringOutputParser } from '@langchain/core/output_parsers';
const langfuse = new LangfuseClient();
const dataset = await langfuse.dataset.get("langchain-eval");
// Create LangChain components
const prompt = PromptTemplate.fromTemplate(
"Translate the following to French: {text}"
);
const model = new ChatOpenAI({ modelName: "gpt-4" });
const outputParser = new StringOutputParser();
const chain = prompt.pipe(model).pipe(outputParser);
const runName = "langchain-translation-eval";
// Process each dataset item
for (const item of dataset.items) {
// Create trace for this execution
const span = startObservation("langchain-execution", {
input: { text: item.input },
metadata: { chainType: "translation" }
});
try {
// Execute chain
const result = await chain.invoke({ text: item.input });
// Update trace with output
span.update({ output: result });
// Link dataset item
await item.link(span, runName, {
description: "LangChain translation evaluation"
});
// Score the result
langfuse.score.observation(span, {
name: "translation_quality",
value: computeQuality(result, item.expectedOutput)
});
} catch (error) {
span.update({
output: { error: String(error) },
level: "ERROR"
});
await item.link(span, runName);
}
span.end();
}
// Flush scores
await langfuse.flush();import { LangfuseClient } from '@langfuse/client';
const langfuse = new LangfuseClient();
// Fetch dataset with structured inputs
const dataset = await langfuse.dataset.get("structured-qa");
// Task that handles structured input
const task = async ({ input }) => {
// Input is an object with specific structure
const { question, context } = input;
const response = await callLLM({
systemPrompt: "Answer questions based on the context.",
userPrompt: `Context: ${context}\n\nQuestion: ${question}`
});
return response;
};
// Evaluator that handles structured output
const evaluator = async ({ input, output, expectedOutput }) => {
const { question } = input;
// Complex evaluation logic
const scores = {
accuracy: evaluateAccuracy(output, expectedOutput),
relevance: evaluateRelevance(output, question),
completeness: evaluateCompleteness(output, expectedOutput)
};
// Return multiple evaluations
return [
{ name: "accuracy", value: scores.accuracy },
{ name: "relevance", value: scores.relevance },
{ name: "completeness", value: scores.completeness },
{
name: "overall",
value: (scores.accuracy + scores.relevance + scores.completeness) / 3,
metadata: { breakdown: scores }
}
];
};
// Run experiment
const result = await dataset.runExperiment({
name: "Structured QA Evaluation",
task,
evaluators: [evaluator]
});
console.log(await result.format({ includeItemResults: true }));fetchItemsPageSize based on your network and memory constraintsmaxConcurrency in experiments to avoid overwhelming APIs or resourcesDatasets integrate seamlessly with the experiment system. For detailed information about experiment execution, evaluators, and result analysis, see the Experiment Management documentation.
Install with Tessl CLI
npx tessl i tessl/npm-langfuse--client@4.2.1