or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

autoevals-adapter.mdclient.mddatasets.mdexperiments.mdindex.mdmedia.mdprompts.mdscores.md

experiments.mddocs/

0

# Experiment Execution

1

2

The Experiment Execution system provides a comprehensive framework for running experiments that test models or tasks against datasets, with support for automatic evaluation, scoring, tracing, and result analysis. It enables systematic testing, comparison, and evaluation of AI models and prompts.

3

4

## Capabilities

5

6

### Run Experiment

7

8

Execute an experiment by running a task on each data item and evaluating the results with full tracing integration.

9

10

```typescript { .api }

11

/**

12

* Executes an experiment by running a task on each data item and evaluating the results

13

*

14

* This method orchestrates the complete experiment lifecycle:

15

* 1. Executes the task function on each data item with proper tracing

16

* 2. Runs item-level evaluators on each task output

17

* 3. Executes run-level evaluators on the complete result set

18

* 4. Links results to dataset runs (for Langfuse datasets)

19

* 5. Stores all scores and traces in Langfuse

20

*

21

* @param config - The experiment configuration

22

* @returns Promise that resolves to experiment results including itemResults, runEvaluations, and format function

23

*/

24

run<Input = any, ExpectedOutput = any, Metadata extends Record<string, any> = Record<string, any>>(

25

config: ExperimentParams<Input, ExpectedOutput, Metadata>

26

): Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>;

27

```

28

29

**Usage Examples:**

30

31

```typescript

32

import { LangfuseClient } from '@langfuse/client';

33

import OpenAI from 'openai';

34

35

const langfuse = new LangfuseClient();

36

const openai = new OpenAI();

37

38

// Basic experiment with custom data

39

const result = await langfuse.experiment.run({

40

name: "Capital Cities Test",

41

description: "Testing model knowledge of world capitals",

42

data: [

43

{ input: "France", expectedOutput: "Paris" },

44

{ input: "Germany", expectedOutput: "Berlin" },

45

{ input: "Japan", expectedOutput: "Tokyo" }

46

],

47

task: async ({ input }) => {

48

const response = await openai.chat.completions.create({

49

model: "gpt-4",

50

messages: [{

51

role: "user",

52

content: `What is the capital of ${input}?`

53

}]

54

});

55

return response.choices[0].message.content;

56

},

57

evaluators: [

58

async ({ output, expectedOutput }) => ({

59

name: "exact_match",

60

value: output === expectedOutput ? 1 : 0

61

})

62

]

63

});

64

65

console.log(await result.format());

66

67

// Experiment on Langfuse dataset

68

const dataset = await langfuse.dataset.get("qa-dataset");

69

70

const datasetResult = await dataset.runExperiment({

71

name: "GPT-4 QA Evaluation",

72

description: "Testing GPT-4 on our QA dataset",

73

task: async ({ input }) => {

74

const response = await openai.chat.completions.create({

75

model: "gpt-4",

76

messages: [{ role: "user", content: input }]

77

});

78

return response.choices[0].message.content;

79

},

80

evaluators: [

81

async ({ output, expectedOutput }) => ({

82

name: "accuracy",

83

value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0,

84

comment: output === expectedOutput ? "Perfect match" : "Case-insensitive match"

85

})

86

]

87

});

88

89

// Multiple evaluators

90

const multiEvalResult = await langfuse.experiment.run({

91

name: "Translation Quality Test",

92

data: [

93

{ input: "Hello world", expectedOutput: "Hola mundo" },

94

{ input: "Good morning", expectedOutput: "Buenos dรญas" }

95

],

96

task: async ({ input }) => translateText(input, 'es'),

97

evaluators: [

98

// Evaluator 1: Exact match

99

async ({ output, expectedOutput }) => ({

100

name: "exact_match",

101

value: output === expectedOutput ? 1 : 0

102

}),

103

// Evaluator 2: BLEU score

104

async ({ output, expectedOutput }) => ({

105

name: "bleu_score",

106

value: calculateBleuScore(output, expectedOutput),

107

comment: "Translation quality metric"

108

}),

109

// Evaluator 3: Length similarity

110

async ({ output, expectedOutput }) => ({

111

name: "length_similarity",

112

value: Math.abs(output.length - expectedOutput.length) <= 5 ? 1 : 0

113

})

114

]

115

});

116

117

// Run-level evaluators for aggregate analysis

118

const aggregateResult = await langfuse.experiment.run({

119

name: "Sentiment Classification",

120

data: sentimentDataset,

121

task: classifysentiment,

122

evaluators: [

123

async ({ output, expectedOutput }) => ({

124

name: "accuracy",

125

value: output === expectedOutput ? 1 : 0

126

})

127

],

128

runEvaluators: [

129

// Average accuracy across all items

130

async ({ itemResults }) => {

131

const accuracyScores = itemResults

132

.flatMap(r => r.evaluations)

133

.filter(e => e.name === "accuracy")

134

.map(e => e.value as number);

135

136

const average = accuracyScores.reduce((a, b) => a + b, 0) / accuracyScores.length;

137

138

return {

139

name: "average_accuracy",

140

value: average,

141

comment: `Overall accuracy: ${(average * 100).toFixed(1)}%`

142

};

143

},

144

// Precision calculation

145

async ({ itemResults }) => {

146

let truePositives = 0;

147

let falsePositives = 0;

148

149

for (const result of itemResults) {

150

if (result.output === "positive") {

151

if (result.expectedOutput === "positive") {

152

truePositives++;

153

} else {

154

falsePositives++;

155

}

156

}

157

}

158

159

const precision = truePositives / (truePositives + falsePositives);

160

161

return {

162

name: "precision",

163

value: precision,

164

comment: `Precision for positive class: ${(precision * 100).toFixed(1)}%`

165

};

166

}

167

]

168

});

169

170

// Concurrency control with maxConcurrency

171

const largeScaleResult = await langfuse.experiment.run({

172

name: "Large Scale Evaluation",

173

description: "Processing 1000 items with rate limiting",

174

data: largeDataset,

175

task: expensiveModelCall,

176

maxConcurrency: 5, // Process max 5 items simultaneously

177

evaluators: [accuracyEvaluator]

178

});

179

180

// Custom run name

181

const customRunResult = await langfuse.experiment.run({

182

name: "Model Comparison",

183

runName: "gpt-4-turbo-2024-01-15",

184

description: "Testing latest GPT-4 Turbo model",

185

data: testData,

186

task: myTask,

187

evaluators: [myEvaluator]

188

});

189

190

// With metadata

191

const metadataResult = await langfuse.experiment.run({

192

name: "Parameter Sweep",

193

metadata: {

194

model: "gpt-4",

195

temperature: 0.7,

196

max_tokens: 1000,

197

experiment_version: "v2.1"

198

},

199

data: testData,

200

task: myTask,

201

evaluators: [myEvaluator]

202

});

203

204

// Formatting results

205

const formattedResult = await langfuse.experiment.run({

206

name: "Test Run",

207

data: testData,

208

task: myTask,

209

evaluators: [myEvaluator]

210

});

211

212

// Format summary only (default)

213

console.log(await formattedResult.format());

214

215

// Format with detailed item results

216

console.log(await formattedResult.format({ includeItemResults: true }));

217

218

// Access raw results

219

console.log(`Processed ${formattedResult.itemResults.length} items`);

220

console.log(`Run evaluations:`, formattedResult.runEvaluations);

221

console.log(`Dataset run URL:`, formattedResult.datasetRunUrl);

222

```

223

224

**OpenTelemetry Integration:**

225

226

The experiment system automatically integrates with OpenTelemetry for distributed tracing:

227

228

```typescript

229

import { LangfuseClient } from '@langfuse/client';

230

import { LangfuseTraceClient } from '@langfuse/tracing';

231

232

// Ensure OpenTelemetry is configured

233

const langfuse = new LangfuseClient();

234

235

// Experiments automatically create traces for each task execution

236

const result = await langfuse.experiment.run({

237

name: "Traced Experiment",

238

data: testData,

239

task: async ({ input }) => {

240

// This task execution is automatically wrapped in a trace

241

// with name "experiment-item-run"

242

const output = await processInput(input);

243

return output;

244

},

245

evaluators: [myEvaluator]

246

});

247

248

// Each item result includes trace information

249

for (const itemResult of result.itemResults) {

250

console.log(`Trace ID: ${itemResult.traceId}`);

251

const traceUrl = await langfuse.getTraceUrl(itemResult.traceId);

252

console.log(`View trace: ${traceUrl}`);

253

}

254

255

// Warning if OpenTelemetry is not set up

256

// The system will log:

257

// "OpenTelemetry has not been set up. Traces will not be sent to Langfuse."

258

```

259

260

**Error Handling:**

261

262

```typescript

263

// Task errors are caught and logged

264

const resilientResult = await langfuse.experiment.run({

265

name: "Resilient Experiment",

266

data: testData,

267

task: async ({ input }) => {

268

try {

269

return await riskyOperation(input);

270

} catch (error) {

271

// Task errors are caught, logged, and item is skipped

272

throw error;

273

}

274

},

275

evaluators: [

276

async ({ output, expectedOutput }) => {

277

try {

278

return {

279

name: "score",

280

value: calculateScore(output, expectedOutput)

281

};

282

} catch (error) {

283

// Evaluator errors are caught and logged

284

// Other evaluators continue to run

285

throw error;

286

}

287

}

288

]

289

});

290

291

// Result contains only successfully processed items

292

console.log(`Successfully processed: ${resilientResult.itemResults.length} items`);

293

294

// Run evaluators also handle errors gracefully

295

const robustResult = await langfuse.experiment.run({

296

name: "Robust Experiment",

297

data: testData,

298

task: myTask,

299

evaluators: [myEvaluator],

300

runEvaluators: [

301

async ({ itemResults }) => {

302

try {

303

return {

304

name: "aggregate_metric",

305

value: calculateAggregate(itemResults)

306

};

307

} catch (error) {

308

// Run evaluator errors are caught and logged

309

// Other run evaluators continue to run

310

throw error;

311

}

312

}

313

]

314

});

315

```

316

317

## Type Definitions

318

319

### ExperimentParams

320

321

Configuration parameters for experiment execution.

322

323

```typescript { .api }

324

type ExperimentParams<

325

Input = any,

326

ExpectedOutput = any,

327

Metadata extends Record<string, any> = Record<string, any>

328

> = {

329

/**

330

* Human-readable name for the experiment.

331

*

332

* This name will appear in Langfuse UI and experiment results.

333

* Choose a descriptive name that identifies the experiment's purpose.

334

*/

335

name: string;

336

337

/**

338

* Optional exact name for the experiment run.

339

*

340

* If provided, this will be used as the exact dataset run name if the data

341

* contains Langfuse dataset items. If not provided, this will default to

342

* the experiment name appended with an ISO timestamp.

343

*/

344

runName?: string;

345

346

/**

347

* Optional description explaining the experiment's purpose.

348

*

349

* Provide context about what you're testing, methodology, or goals.

350

* This helps with experiment tracking and result interpretation.

351

*/

352

description?: string;

353

354

/**

355

* Optional metadata to attach to the experiment run.

356

*

357

* Store additional context like model versions, hyperparameters,

358

* or any other relevant information for analysis and comparison.

359

*/

360

metadata?: Record<string, any>;

361

362

/**

363

* Array of data items to process.

364

*

365

* Can be either custom ExperimentItem[] or DatasetItem[] from Langfuse.

366

* Each item should contain input data and optionally expected output.

367

*/

368

data: ExperimentItem<Input, ExpectedOutput, Metadata>[];

369

370

/**

371

* The task function to execute on each data item.

372

*

373

* This function receives input data and produces output that will be evaluated.

374

* It should encapsulate the model or system being tested.

375

*/

376

task: ExperimentTask<Input, ExpectedOutput, Metadata>;

377

378

/**

379

* Optional array of evaluator functions to assess each item's output.

380

*

381

* Each evaluator receives input, output, and expected output (if available)

382

* and returns evaluation results. Multiple evaluators enable comprehensive assessment.

383

*/

384

evaluators?: Evaluator<Input, ExpectedOutput, Metadata>[];

385

386

/**

387

* Optional array of run-level evaluators to assess the entire experiment.

388

*

389

* These evaluators receive all item results and can perform aggregate analysis

390

* like calculating averages, detecting patterns, or statistical analysis.

391

*/

392

runEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[];

393

394

/**

395

* Maximum number of concurrent task executions (default: Infinity).

396

*

397

* Controls parallelism to manage resource usage and API rate limits.

398

* Set lower values for expensive operations or rate-limited services.

399

*/

400

maxConcurrency?: number;

401

};

402

```

403

404

**Usage Examples:**

405

406

```typescript

407

import type { ExperimentParams } from '@langfuse/client';

408

409

// Type-safe experiment configuration

410

const config: ExperimentParams<string, string> = {

411

name: "Capital Cities",

412

description: "Testing geography knowledge",

413

metadata: {

414

model: "gpt-4",

415

version: "v1.0"

416

},

417

data: [

418

{ input: "France", expectedOutput: "Paris" },

419

{ input: "Germany", expectedOutput: "Berlin" }

420

],

421

task: async ({ input }) => {

422

return await getCapital(input);

423

},

424

evaluators: [exactMatchEvaluator],

425

runEvaluators: [averageScoreEvaluator],

426

maxConcurrency: 3

427

};

428

429

await langfuse.experiment.run(config);

430

431

// Generic types for complex data

432

interface CustomInput {

433

question: string;

434

context: string[];

435

}

436

437

interface CustomOutput {

438

answer: string;

439

confidence: number;

440

}

441

442

interface CustomMetadata {

443

category: string;

444

difficulty: "easy" | "medium" | "hard";

445

}

446

447

const typedConfig: ExperimentParams<CustomInput, CustomOutput, CustomMetadata> = {

448

name: "QA with Context",

449

data: [

450

{

451

input: {

452

question: "What is AI?",

453

context: ["AI stands for Artificial Intelligence"]

454

},

455

expectedOutput: {

456

answer: "Artificial Intelligence",

457

confidence: 0.95

458

},

459

metadata: {

460

category: "technology",

461

difficulty: "easy"

462

}

463

}

464

],

465

task: async ({ input }) => {

466

// Type-safe input and output

467

return await qaModel(input.question, input.context);

468

},

469

evaluators: [

470

async ({ input, output, expectedOutput }) => {

471

// All parameters are fully typed

472

return {

473

name: "accuracy",

474

value: output.answer === expectedOutput?.answer ? 1 : 0

475

};

476

}

477

]

478

};

479

```

480

481

### ExperimentTask

482

483

Function type for experiment tasks that process input data and return output.

484

485

```typescript { .api }

486

/**

487

* Function type for experiment tasks that process input data and return output

488

*

489

* The task function is the core component being tested in an experiment.

490

* It receives either an ExperimentItem or DatasetItem and produces output

491

* that will be evaluated.

492

*

493

* @param params - Either an ExperimentItem or DatasetItem containing input and metadata

494

* @returns Promise resolving to the task's output (any type)

495

*/

496

type ExperimentTask<

497

Input = any,

498

ExpectedOutput = any,

499

Metadata extends Record<string, any> = Record<string, any>

500

> = (

501

params: ExperimentTaskParams<Input, ExpectedOutput, Metadata>

502

) => Promise<any>;

503

504

type ExperimentTaskParams<

505

Input = any,

506

ExpectedOutput = any,

507

Metadata extends Record<string, any> = Record<string, any>

508

> = ExperimentItem<Input, ExpectedOutput, Metadata>;

509

```

510

511

**Usage Examples:**

512

513

```typescript

514

import type { ExperimentTask } from '@langfuse/client';

515

516

// Simple task function

517

const simpleTask: ExperimentTask = async ({ input }) => {

518

return await processInput(input);

519

};

520

521

// Task with type safety

522

const typedTask: ExperimentTask<string, string> = async ({ input, metadata }) => {

523

// input is typed as string

524

// metadata is typed as Record<string, any>

525

return await processString(input);

526

};

527

528

// Task accessing expected output (for reference)

529

const referenceTask: ExperimentTask = async ({ input, expectedOutput }) => {

530

// Can access expectedOutput for context (but shouldn't use it for cheating!)

531

console.log(`Processing input, expecting: ${expectedOutput}`);

532

return await myModel(input);

533

};

534

535

// Task with custom types

536

interface QuestionInput {

537

question: string;

538

context: string;

539

}

540

541

const qaTask: ExperimentTask<QuestionInput, string> = async ({ input, metadata }) => {

542

const { question, context } = input;

543

return await answerQuestion(question, context);

544

};

545

546

// Task handling both ExperimentItem and DatasetItem

547

const universalTask: ExperimentTask = async (item) => {

548

// Works with both types

549

const input = item.input;

550

const meta = item.metadata || {};

551

552

// Check if it's a DatasetItem (has id and datasetId)

553

if ('id' in item && 'datasetId' in item) {

554

console.log(`Processing dataset item: ${item.id}`);

555

}

556

557

return await process(input, meta);

558

};

559

560

// Task with error handling

561

const robustTask: ExperimentTask = async ({ input }) => {

562

try {

563

return await riskyOperation(input);

564

} catch (error) {

565

console.error(`Task failed for input:`, input, error);

566

throw error; // Re-throw to skip this item

567

}

568

};

569

570

// Task with nested tracing

571

const tracedTask: ExperimentTask = async ({ input }) => {

572

// Nested operations are automatically traced

573

const step1 = await preprocessInput(input);

574

const step2 = await modelInference(step1);

575

const step3 = await postprocess(step2);

576

return step3;

577

};

578

```

579

580

### ExperimentItem

581

582

Data item type for experiment inputs, supporting both custom items and Langfuse dataset items.

583

584

```typescript { .api }

585

/**

586

* Experiment data item or dataset item

587

*

588

* Can be either a custom item with input/expectedOutput/metadata

589

* or a DatasetItem from Langfuse

590

*/

591

type ExperimentItem<

592

Input = any,

593

ExpectedOutput = any,

594

Metadata extends Record<string, any> = Record<string, any>

595

> =

596

| {

597

/**

598

* The input data to pass to the task function.

599

*

600

* Can be any type - string, object, array, etc. This data will be passed

601

* to your task function as the `input` parameter.

602

*/

603

input?: Input;

604

605

/**

606

* The expected output for evaluation purposes.

607

*

608

* Optional ground truth or reference output for this input.

609

* Used by evaluators to assess task performance.

610

*/

611

expectedOutput?: ExpectedOutput;

612

613

/**

614

* Optional metadata to attach to the experiment item.

615

*

616

* Store additional context, tags, or custom data related to this specific item.

617

* This metadata will be available in traces and evaluators.

618

*/

619

metadata?: Metadata;

620

}

621

| DatasetItem;

622

```

623

624

**Usage Examples:**

625

626

```typescript

627

import type { ExperimentItem } from '@langfuse/client';

628

629

// Simple string items

630

const stringItems: ExperimentItem<string, string>[] = [

631

{ input: "Hello", expectedOutput: "Hola" },

632

{ input: "Goodbye", expectedOutput: "Adiรณs" }

633

];

634

635

// Complex structured items

636

interface QAInput {

637

question: string;

638

context: string;

639

}

640

641

const qaItems: ExperimentItem<QAInput, string>[] = [

642

{

643

input: {

644

question: "What is AI?",

645

context: "AI stands for Artificial Intelligence..."

646

},

647

expectedOutput: "Artificial Intelligence"

648

}

649

];

650

651

// Items with metadata

652

const itemsWithMetadata: ExperimentItem<string, string, { category: string }>[] = [

653

{

654

input: "Test input",

655

expectedOutput: "Expected output",

656

metadata: {

657

category: "technology"

658

}

659

}

660

];

661

662

// Items without expected output (evaluation-only based on output)

663

const noExpectedOutput: ExperimentItem<string, never>[] = [

664

{ input: "Generate creative text" }

665

// No expectedOutput - evaluators won't have ground truth

666

];

667

668

// Mixed with Langfuse dataset items

669

const dataset = await langfuse.dataset.get("my-dataset");

670

const mixedItems: ExperimentItem[] = [

671

// Custom items

672

{ input: "Custom input", expectedOutput: "Custom output" },

673

// Dataset items

674

...dataset.items

675

];

676

677

// Accessing item properties in task

678

const task: ExperimentTask = async (item) => {

679

if ('id' in item && 'datasetId' in item) {

680

// It's a DatasetItem

681

console.log(`Dataset item ID: ${item.id}`);

682

console.log(`Dataset ID: ${item.datasetId}`);

683

}

684

685

return await process(item.input);

686

};

687

```

688

689

### Evaluator

690

691

Function type for item-level evaluators that assess individual task outputs.

692

693

```typescript { .api }

694

/**

695

* Evaluator function for item-level evaluation

696

*

697

* Receives input, output, expected output, and metadata,

698

* and returns evaluation results as Evaluation object(s).

699

*

700

* @param params - Parameters including input, output, expectedOutput, and metadata

701

* @returns Promise resolving to single Evaluation or array of Evaluations

702

*/

703

type Evaluator<

704

Input = any,

705

ExpectedOutput = any,

706

Metadata extends Record<string, any> = Record<string, any>

707

> = (

708

params: EvaluatorParams<Input, ExpectedOutput, Metadata>

709

) => Promise<Evaluation[] | Evaluation>;

710

711

type EvaluatorParams<

712

Input = any,

713

ExpectedOutput = any,

714

Metadata extends Record<string, any> = Record<string, any>

715

> = {

716

/**

717

* The original input data passed to the task.

718

*

719

* Use this for context-aware evaluations or input-output relationship analysis.

720

*/

721

input: Input;

722

723

/**

724

* The output produced by the task.

725

*

726

* This is the actual result returned by your task function.

727

*/

728

output: any;

729

730

/**

731

* The expected output for comparison (optional).

732

*

733

* This is the ground truth or expected result for the given input.

734

*/

735

expectedOutput?: ExpectedOutput;

736

};

737

```

738

739

**Usage Examples:**

740

741

```typescript

742

import type { Evaluator, Evaluation } from '@langfuse/client';

743

744

// Simple exact match evaluator

745

const exactMatch: Evaluator = async ({ output, expectedOutput }) => ({

746

name: "exact_match",

747

value: output === expectedOutput ? 1 : 0

748

});

749

750

// Case-insensitive match with comment

751

const caseInsensitiveMatch: Evaluator = async ({ output, expectedOutput }) => {

752

const match = output.toLowerCase() === expectedOutput?.toLowerCase();

753

return {

754

name: "case_insensitive_match",

755

value: match ? 1 : 0,

756

comment: match ? "Perfect match" : "No match"

757

};

758

};

759

760

// Evaluator returning multiple scores

761

const comprehensiveEvaluator: Evaluator = async ({ output, expectedOutput }) => {

762

return [

763

{

764

name: "exact_match",

765

value: output === expectedOutput ? 1 : 0

766

},

767

{

768

name: "length_match",

769

value: Math.abs(output.length - expectedOutput.length) <= 5 ? 1 : 0

770

},

771

{

772

name: "similarity",

773

value: calculateSimilarity(output, expectedOutput),

774

comment: "Cosine similarity score"

775

}

776

];

777

};

778

779

// Type-safe evaluator

780

const typedEvaluator: Evaluator<string, string> = async ({ input, output, expectedOutput }) => {

781

// All parameters are typed

782

return {

783

name: "accuracy",

784

value: output === expectedOutput ? 1 : 0,

785

metadata: { input_length: input.length }

786

};

787

};

788

789

// Evaluator using input context

790

const contextAwareEvaluator: Evaluator = async ({ input, output }) => {

791

const isValid = validateOutput(output, input);

792

return {

793

name: "validity",

794

value: isValid ? 1 : 0,

795

comment: isValid ? "Output valid for input" : "Output invalid"

796

};

797

};

798

799

// Evaluator with metadata

800

const categoryEvaluator: Evaluator<any, any, { category: string }> = async ({

801

output,

802

expectedOutput,

803

metadata

804

}) => {

805

const score = calculateScore(output, expectedOutput);

806

return {

807

name: "category_score",

808

value: score,

809

metadata: {

810

category: metadata?.category,

811

timestamp: new Date().toISOString()

812

}

813

};

814

};

815

816

// Evaluator with different data types

817

const numericEvaluator: Evaluator = async ({ output, expectedOutput }) => {

818

const error = Math.abs(output - expectedOutput);

819

return {

820

name: "absolute_error",

821

value: error,

822

dataType: "numeric",

823

comment: `Error: ${error.toFixed(2)}`

824

};

825

};

826

827

// Boolean evaluator

828

const booleanEvaluator: Evaluator = async ({ output }) => {

829

return {

830

name: "is_valid",

831

value: validateFormat(output),

832

dataType: "boolean"

833

};

834

};

835

836

// Evaluator with error handling

837

const robustEvaluator: Evaluator = async ({ output, expectedOutput }) => {

838

try {

839

const score = complexCalculation(output, expectedOutput);

840

return {

841

name: "complex_score",

842

value: score

843

};

844

} catch (error) {

845

console.error("Evaluator failed:", error);

846

throw error; // Will be caught and logged by experiment system

847

}

848

};

849

850

// LLM-as-judge evaluator

851

const llmJudgeEvaluator: Evaluator = async ({ input, output, expectedOutput }) => {

852

const judgmentPrompt = `

853

Input: ${input}

854

Expected: ${expectedOutput}

855

Actual: ${output}

856

857

Rate the quality from 0 to 1:

858

`;

859

860

const judgment = await llm.evaluate(judgmentPrompt);

861

862

return {

863

name: "llm_judgment",

864

value: parseFloat(judgment),

865

comment: `LLM evaluation of output quality`

866

};

867

};

868

```

869

870

### RunEvaluator

871

872

Function type for run-level evaluators that assess the entire experiment.

873

874

```typescript { .api }

875

/**

876

* Evaluator function for run-level evaluation

877

*

878

* Receives all item results and performs aggregate analysis

879

* across the entire experiment run.

880

*

881

* @param params - Parameters including all itemResults

882

* @returns Promise resolving to single Evaluation or array of Evaluations

883

*/

884

type RunEvaluator<

885

Input = any,

886

ExpectedOutput = any,

887

Metadata extends Record<string, any> = Record<string, any>

888

> = (

889

params: RunEvaluatorParams<Input, ExpectedOutput, Metadata>

890

) => Promise<Evaluation[] | Evaluation>;

891

892

type RunEvaluatorParams<

893

Input = any,

894

ExpectedOutput = any,

895

Metadata extends Record<string, any> = Record<string, any>

896

> = {

897

/**

898

* Results from all processed experiment items.

899

*

900

* Each item contains the input, output, evaluations, and metadata from

901

* processing a single data item. Use this for aggregate analysis,

902

* statistical calculations, and cross-item comparisons.

903

*/

904

itemResults: ExperimentItemResult<Input, ExpectedOutput, Metadata>[];

905

};

906

```

907

908

**Usage Examples:**

909

910

```typescript

911

import type { RunEvaluator } from '@langfuse/client';

912

913

// Average score evaluator

914

const averageScoreEvaluator: RunEvaluator = async ({ itemResults }) => {

915

const scores = itemResults

916

.flatMap(r => r.evaluations)

917

.filter(e => e.name === "accuracy")

918

.map(e => e.value as number);

919

920

const average = scores.reduce((a, b) => a + b, 0) / scores.length;

921

922

return {

923

name: "average_accuracy",

924

value: average,

925

comment: `Average accuracy: ${(average * 100).toFixed(1)}%`

926

};

927

};

928

929

// Multiple run-level metrics

930

const comprehensiveRunEvaluator: RunEvaluator = async ({ itemResults }) => {

931

const accuracyScores = itemResults

932

.flatMap(r => r.evaluations)

933

.filter(e => e.name === "accuracy")

934

.map(e => e.value as number);

935

936

const average = accuracyScores.reduce((a, b) => a + b, 0) / accuracyScores.length;

937

const min = Math.min(...accuracyScores);

938

const max = Math.max(...accuracyScores);

939

const stdDev = calculateStdDev(accuracyScores);

940

941

return [

942

{

943

name: "average_accuracy",

944

value: average

945

},

946

{

947

name: "min_accuracy",

948

value: min

949

},

950

{

951

name: "max_accuracy",

952

value: max

953

},

954

{

955

name: "std_dev_accuracy",

956

value: stdDev,

957

comment: "Standard deviation of accuracy scores"

958

}

959

];

960

};

961

962

// Precision and recall

963

const precisionRecallEvaluator: RunEvaluator = async ({ itemResults }) => {

964

let truePositives = 0;

965

let falsePositives = 0;

966

let falseNegatives = 0;

967

968

for (const result of itemResults) {

969

if (result.output === "positive") {

970

if (result.expectedOutput === "positive") {

971

truePositives++;

972

} else {

973

falsePositives++;

974

}

975

} else if (result.expectedOutput === "positive") {

976

falseNegatives++;

977

}

978

}

979

980

const precision = truePositives / (truePositives + falsePositives);

981

const recall = truePositives / (truePositives + falseNegatives);

982

const f1 = 2 * (precision * recall) / (precision + recall);

983

984

return [

985

{

986

name: "precision",

987

value: precision,

988

comment: `Precision: ${(precision * 100).toFixed(1)}%`

989

},

990

{

991

name: "recall",

992

value: recall,

993

comment: `Recall: ${(recall * 100).toFixed(1)}%`

994

},

995

{

996

name: "f1_score",

997

value: f1,

998

comment: `F1 Score: ${(f1 * 100).toFixed(1)}%`

999

}

1000

];

1001

};

1002

1003

// Category-based analysis

1004

const categoryAnalysisEvaluator: RunEvaluator<any, any, { category: string }> = async ({

1005

itemResults

1006

}) => {

1007

const categories = new Map<string, number[]>();

1008

1009

for (const result of itemResults) {

1010

const category = result.item.metadata?.category || "unknown";

1011

const accuracy = result.evaluations.find(e => e.name === "accuracy")?.value as number;

1012

1013

if (!categories.has(category)) {

1014

categories.set(category, []);

1015

}

1016

categories.get(category)!.push(accuracy);

1017

}

1018

1019

const evaluations: Evaluation[] = [];

1020

1021

for (const [category, scores] of categories) {

1022

const average = scores.reduce((a, b) => a + b, 0) / scores.length;

1023

evaluations.push({

1024

name: `accuracy_${category}`,

1025

value: average,

1026

comment: `Average accuracy for ${category}: ${(average * 100).toFixed(1)}%`

1027

});

1028

}

1029

1030

return evaluations;

1031

};

1032

1033

// Percentile analysis

1034

const percentileEvaluator: RunEvaluator = async ({ itemResults }) => {

1035

const scores = itemResults

1036

.flatMap(r => r.evaluations)

1037

.filter(e => e.name === "score")

1038

.map(e => e.value as number)

1039

.sort((a, b) => a - b);

1040

1041

const p50 = scores[Math.floor(scores.length * 0.5)];

1042

const p90 = scores[Math.floor(scores.length * 0.9)];

1043

const p95 = scores[Math.floor(scores.length * 0.95)];

1044

1045

return [

1046

{ name: "p50_score", value: p50, comment: "Median score" },

1047

{ name: "p90_score", value: p90, comment: "90th percentile" },

1048

{ name: "p95_score", value: p95, comment: "95th percentile" }

1049

];

1050

};

1051

1052

// Failure analysis

1053

const failureAnalysisEvaluator: RunEvaluator = async ({ itemResults }) => {

1054

const failures = itemResults.filter(r => {

1055

const accuracy = r.evaluations.find(e => e.name === "accuracy")?.value;

1056

return accuracy === 0;

1057

});

1058

1059

const failureRate = failures.length / itemResults.length;

1060

1061

return {

1062

name: "failure_rate",

1063

value: failureRate,

1064

comment: `${failures.length} of ${itemResults.length} items failed (${(failureRate * 100).toFixed(1)}%)`

1065

};

1066

};

1067

1068

// Cross-item consistency

1069

const consistencyEvaluator: RunEvaluator = async ({ itemResults }) => {

1070

// Check if similar inputs produce similar outputs

1071

const consistency = analyzeConsistency(itemResults);

1072

1073

return {

1074

name: "consistency_score",

1075

value: consistency,

1076

comment: "Consistency across similar inputs"

1077

};

1078

};

1079

```

1080

1081

### Evaluation

1082

1083

Result type for evaluations returned by evaluator functions.

1084

1085

```typescript { .api }

1086

/**

1087

* Evaluation result from an evaluator

1088

*

1089

* Contains the score name, value, and optional metadata/comment

1090

*/

1091

type Evaluation = Pick<

1092

ScoreBody,

1093

"name" | "value" | "comment" | "metadata" | "dataType"

1094

>;

1095

1096

interface Evaluation {

1097

/**

1098

* Name of the evaluation metric

1099

*

1100

* Should be descriptive and unique within the evaluator set.

1101

*/

1102

name: string;

1103

1104

/**

1105

* Numeric or boolean value of the evaluation

1106

*

1107

* Typically 0-1 for accuracy/similarity scores, but can be any numeric value.

1108

*/

1109

value: number | boolean;

1110

1111

/**

1112

* Optional human-readable comment about the evaluation

1113

*

1114

* Useful for explaining the score or providing context.

1115

*/

1116

comment?: string;

1117

1118

/**

1119

* Optional metadata about the evaluation

1120

*

1121

* Store additional context or debugging information.

1122

*/

1123

metadata?: Record<string, any>;

1124

1125

/**

1126

* Optional data type specification

1127

*

1128

* Specifies how the value should be interpreted.

1129

*/

1130

dataType?: "numeric" | "boolean" | "categorical";

1131

}

1132

```

1133

1134

**Usage Examples:**

1135

1136

```typescript

1137

import type { Evaluation } from '@langfuse/client';

1138

1139

// Simple numeric evaluation

1140

const simpleEval: Evaluation = {

1141

name: "accuracy",

1142

value: 0.85

1143

};

1144

1145

// Boolean evaluation

1146

const booleanEval: Evaluation = {

1147

name: "passed",

1148

value: true,

1149

dataType: "boolean"

1150

};

1151

1152

// Evaluation with comment

1153

const commentedEval: Evaluation = {

1154

name: "similarity",

1155

value: 0.92,

1156

comment: "High similarity between output and expected"

1157

};

1158

1159

// Evaluation with metadata

1160

const metadataEval: Evaluation = {

1161

name: "response_quality",

1162

value: 0.88,

1163

metadata: {

1164

model: "gpt-4",

1165

temperature: 0.7,

1166

tokens: 150

1167

},

1168

comment: "Quality assessment using LLM judge"

1169

};

1170

1171

// Multiple evaluation types

1172

const multiEval: Evaluation[] = [

1173

{

1174

name: "exact_match",

1175

value: 1,

1176

dataType: "boolean"

1177

},

1178

{

1179

name: "similarity",

1180

value: 0.95,

1181

dataType: "numeric",

1182

comment: "Cosine similarity"

1183

},

1184

{

1185

name: "category",

1186

value: 0,

1187

dataType: "categorical",

1188

metadata: { predicted: "A", actual: "B" }

1189

}

1190

];

1191

```

1192

1193

### ExperimentResult

1194

1195

Complete result structure returned by the run() method.

1196

1197

```typescript { .api }

1198

/**

1199

* Complete result of an experiment execution

1200

*

1201

* Contains all results from processing the experiment data,

1202

* including individual item results, run-level evaluations,

1203

* and utilities for result visualization.

1204

*/

1205

type ExperimentResult<

1206

Input = any,

1207

ExpectedOutput = any,

1208

Metadata extends Record<string, any> = Record<string, any>

1209

> = {

1210

/**

1211

* The experiment run name.

1212

*

1213

* Either the provided runName parameter or generated name (experiment name + timestamp).

1214

*/

1215

runName: string;

1216

1217

/**

1218

* ID of the dataset run in Langfuse (only for experiments on Langfuse datasets).

1219

*

1220

* Use this ID to access the dataset run via the Langfuse API or UI.

1221

*/

1222

datasetRunId?: string;

1223

1224

/**

1225

* URL to the dataset run in the Langfuse UI (only for experiments on Langfuse datasets).

1226

*

1227

* Direct link to view the complete dataset run in the Langfuse web interface.

1228

*/

1229

datasetRunUrl?: string;

1230

1231

/**

1232

* Results from processing each individual data item.

1233

*

1234

* Contains the complete results for every item in your experiment data.

1235

*/

1236

itemResults: ExperimentItemResult<Input, ExpectedOutput, Metadata>[];

1237

1238

/**

1239

* Results from run-level evaluators that assessed the entire experiment.

1240

*

1241

* Contains aggregate evaluations that analyze the complete experiment.

1242

*/

1243

runEvaluations: Evaluation[];

1244

1245

/**

1246

* Function to format experiment results in a human-readable format.

1247

*

1248

* @param options - Formatting options

1249

* @param options.includeItemResults - Whether to include individual item details (default: false)

1250

* @returns Promise resolving to formatted string representation

1251

*/

1252

format: (options?: { includeItemResults?: boolean }) => Promise<string>;

1253

};

1254

```

1255

1256

**Usage Examples:**

1257

1258

```typescript

1259

import type { ExperimentResult } from '@langfuse/client';

1260

1261

// Run experiment and access results

1262

const result: ExperimentResult = await langfuse.experiment.run({

1263

name: "Test Experiment",

1264

data: testData,

1265

task: myTask,

1266

evaluators: [accuracyEvaluator],

1267

runEvaluators: [averageEvaluator]

1268

});

1269

1270

// Access run name

1271

console.log(`Run name: ${result.runName}`);

1272

// "Test Experiment - 2024-01-15T10:30:00.000Z"

1273

1274

// Access individual item results

1275

console.log(`Processed ${result.itemResults.length} items`);

1276

for (const itemResult of result.itemResults) {

1277

console.log(`Input: ${itemResult.input}`);

1278

console.log(`Output: ${itemResult.output}`);

1279

console.log(`Evaluations:`, itemResult.evaluations);

1280

}

1281

1282

// Access run-level evaluations

1283

console.log(`Run evaluations:`, result.runEvaluations);

1284

const avgAccuracy = result.runEvaluations.find(e => e.name === "average_accuracy");

1285

console.log(`Average accuracy: ${avgAccuracy?.value}`);

1286

1287

// Format results (summary only)

1288

const summary = await result.format();

1289

console.log(summary);

1290

/*

1291

Individual Results: Hidden (10 items)

1292

๐Ÿ’ก Call format({ includeItemResults: true }) to view them

1293

1294

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

1295

๐Ÿงช Experiment: Test Experiment

1296

๐Ÿ“‹ Run name: Test Experiment - 2024-01-15T10:30:00.000Z

1297

10 items

1298

Evaluations:

1299

โ€ข accuracy

1300

1301

Average Scores:

1302

โ€ข accuracy: 0.850

1303

1304

Run Evaluations:

1305

โ€ข average_accuracy: 0.850

1306

๐Ÿ’ญ Average accuracy: 85.0%

1307

*/

1308

1309

// Format with detailed results

1310

const detailed = await result.format({ includeItemResults: true });

1311

console.log(detailed);

1312

/*

1313

1. Item 1:

1314

Input: What is AI?

1315

Expected: Artificial Intelligence

1316

Actual: Artificial Intelligence

1317

Scores:

1318

โ€ข accuracy: 1.000

1319

1320

Trace:

1321

https://cloud.langfuse.com/project/xxx/traces/abc123

1322

1323

2. Item 2:

1324

...

1325

1326

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

1327

๐Ÿงช Experiment: Test Experiment

1328

...

1329

*/

1330

1331

// Access dataset run information (if applicable)

1332

if (result.datasetRunId) {

1333

console.log(`Dataset run ID: ${result.datasetRunId}`);

1334

console.log(`View in UI: ${result.datasetRunUrl}`);

1335

}

1336

1337

// Calculate custom metrics from results

1338

const successRate = result.itemResults.filter(r =>

1339

r.evaluations.some(e => e.name === "accuracy" && e.value === 1)

1340

).length / result.itemResults.length;

1341

console.log(`Success rate: ${(successRate * 100).toFixed(1)}%`);

1342

1343

// Export results for further analysis

1344

const exportData = result.itemResults.map(r => ({

1345

input: r.input,

1346

output: r.output,

1347

expectedOutput: r.expectedOutput,

1348

scores: Object.fromEntries(

1349

r.evaluations.map(e => [e.name, e.value])

1350

)

1351

}));

1352

await fs.writeFile('results.json', JSON.stringify(exportData, null, 2));

1353

```

1354

1355

### ExperimentItemResult

1356

1357

Result structure for individual item processing within an experiment.

1358

1359

```typescript { .api }

1360

/**

1361

* Result from processing one experiment item

1362

*

1363

* Contains the input, output, evaluations, and trace information

1364

* for a single data item.

1365

*/

1366

type ExperimentItemResult<

1367

Input = any,

1368

ExpectedOutput = any,

1369

Metadata extends Record<string, any> = Record<string, any>

1370

> = {

1371

/**

1372

* The original experiment or dataset item that was processed.

1373

*

1374

* Contains the complete original item data.

1375

*/

1376

item: ExperimentItem<Input, ExpectedOutput, Metadata>;

1377

1378

/**

1379

* The input data (extracted from item for convenience)

1380

*/

1381

input?: Input;

1382

1383

/**

1384

* The expected output (extracted from item for convenience)

1385

*/

1386

expectedOutput?: ExpectedOutput;

1387

1388

/**

1389

* The actual output produced by the task.

1390

*

1391

* This is the result returned by your task function for this specific input.

1392

*/

1393

output: any;

1394

1395

/**

1396

* Results from all evaluators that ran on this item.

1397

*

1398

* Contains evaluation scores, comments, and metadata from each evaluator.

1399

*/

1400

evaluations: Evaluation[];

1401

1402

/**

1403

* Langfuse trace ID for this item's execution.

1404

*

1405

* Use this ID to view detailed execution traces in the Langfuse UI.

1406

*/

1407

traceId?: string;

1408

1409

/**

1410

* Dataset run ID if this item was part of a Langfuse dataset.

1411

*

1412

* Links this item result to a specific dataset run.

1413

*/

1414

datasetRunId?: string;

1415

};

1416

```

1417

1418

**Usage Examples:**

1419

1420

```typescript

1421

import type { ExperimentItemResult } from '@langfuse/client';

1422

1423

// Process experiment results

1424

const result = await langfuse.experiment.run(config);

1425

1426

for (const itemResult: ExperimentItemResult of result.itemResults) {

1427

// Access item data

1428

console.log(`Processing item:`, itemResult.item);

1429

console.log(`Input:`, itemResult.input);

1430

console.log(`Expected:`, itemResult.expectedOutput);

1431

console.log(`Actual:`, itemResult.output);

1432

1433

// Access evaluations

1434

for (const evaluation of itemResult.evaluations) {

1435

console.log(`${evaluation.name}: ${evaluation.value}`);

1436

if (evaluation.comment) {

1437

console.log(` Comment: ${evaluation.comment}`);

1438

}

1439

}

1440

1441

// Access trace information

1442

if (itemResult.traceId) {

1443

const traceUrl = await langfuse.getTraceUrl(itemResult.traceId);

1444

console.log(`View trace: ${traceUrl}`);

1445

}

1446

1447

// Access dataset run information

1448

if (itemResult.datasetRunId) {

1449

console.log(`Dataset run ID: ${itemResult.datasetRunId}`);

1450

}

1451

}

1452

1453

// Filter failed items

1454

const failedItems = result.itemResults.filter(r =>

1455

r.evaluations.some(e => e.name === "accuracy" && e.value === 0)

1456

);

1457

console.log(`Failed items: ${failedItems.length}`);

1458

1459

// Group by score

1460

const highScoring = result.itemResults.filter(r =>

1461

r.evaluations.some(e => e.name === "accuracy" && (e.value as number) >= 0.8)

1462

);

1463

const lowScoring = result.itemResults.filter(r =>

1464

r.evaluations.some(e => e.name === "accuracy" && (e.value as number) < 0.5)

1465

);

1466

1467

// Analyze patterns

1468

const errorPatterns = failedItems.map(r => ({

1469

input: r.input,

1470

output: r.output,

1471

expected: r.expectedOutput

1472

}));

1473

console.log("Error patterns:", errorPatterns);

1474

```

1475

1476

## Integration with AutoEvals

1477

1478

Create Langfuse-compatible evaluators from AutoEvals library evaluators.

1479

1480

```typescript { .api }

1481

/**

1482

* Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function

1483

*

1484

* This adapter handles parameter mapping and result formatting automatically.

1485

* AutoEvals evaluators expect `input`, `output`, and `expected` parameters,

1486

* while Langfuse evaluators use `input`, `output`, and `expectedOutput`.

1487

*

1488

* @param autoevalEvaluator - The AutoEvals evaluator function to convert

1489

* @param params - Optional additional parameters to pass to the AutoEvals evaluator

1490

* @returns A Langfuse-compatible evaluator function

1491

*/

1492

function createEvaluatorFromAutoevals<E extends CallableFunction>(

1493

autoevalEvaluator: E,

1494

params?: Params<E>

1495

): Evaluator;

1496

```

1497

1498

**Usage Examples:**

1499

1500

```typescript

1501

import { Factuality, Levenshtein, ClosedQA } from 'autoevals';

1502

import { createEvaluatorFromAutoevals } from '@langfuse/client';

1503

1504

// Basic AutoEvals integration

1505

const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);

1506

const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);

1507

1508

await langfuse.experiment.run({

1509

name: "AutoEvals Integration Test",

1510

data: myDataset,

1511

task: myTask,

1512

evaluators: [factualityEvaluator, levenshteinEvaluator]

1513

});

1514

1515

// With additional parameters

1516

const customFactualityEvaluator = createEvaluatorFromAutoevals(

1517

Factuality,

1518

{ model: 'gpt-4o' } // Additional params for AutoEvals

1519

);

1520

1521

await langfuse.experiment.run({

1522

name: "Factuality Test",

1523

data: testData,

1524

task: myTask,

1525

evaluators: [customFactualityEvaluator]

1526

});

1527

1528

// Multiple AutoEvals evaluators

1529

const closedQAEvaluator = createEvaluatorFromAutoevals(ClosedQA, {

1530

model: 'gpt-4',

1531

useCoT: true

1532

});

1533

1534

const comprehensiveEvaluators = [

1535

createEvaluatorFromAutoevals(Factuality),

1536

createEvaluatorFromAutoevals(Levenshtein),

1537

closedQAEvaluator

1538

];

1539

1540

await langfuse.experiment.run({

1541

name: "Comprehensive Evaluation",

1542

data: qaDataset,

1543

task: qaTask,

1544

evaluators: comprehensiveEvaluators

1545

});

1546

1547

// Mixing AutoEvals and custom evaluators

1548

await langfuse.experiment.run({

1549

name: "Mixed Evaluators",

1550

data: dataset,

1551

task: task,

1552

evaluators: [

1553

// AutoEvals evaluators

1554

createEvaluatorFromAutoevals(Factuality),

1555

createEvaluatorFromAutoevals(Levenshtein),

1556

// Custom evaluator

1557

async ({ output, expectedOutput }) => ({

1558

name: "exact_match",

1559

value: output === expectedOutput ? 1 : 0

1560

})

1561

]

1562

});

1563

```

1564

1565

## Advanced Usage

1566

1567

### Type Safety with Generics

1568

1569

Use TypeScript generics for full type safety across the experiment pipeline.

1570

1571

```typescript

1572

// Define your types

1573

interface QuestionInput {

1574

question: string;

1575

context: string[];

1576

}

1577

1578

interface AnswerOutput {

1579

answer: string;

1580

confidence: number;

1581

sources: string[];

1582

}

1583

1584

interface ItemMetadata {

1585

category: "science" | "history" | "literature";

1586

difficulty: number;

1587

tags: string[];

1588

}

1589

1590

// Type-safe experiment configuration

1591

const result = await langfuse.experiment.run<

1592

QuestionInput,

1593

AnswerOutput,

1594

ItemMetadata

1595

>({

1596

name: "Typed QA Experiment",

1597

data: [

1598

{

1599

input: {

1600

question: "What is photosynthesis?",

1601

context: ["Photosynthesis is the process..."]

1602

},

1603

expectedOutput: {

1604

answer: "A process where plants convert light to energy",

1605

confidence: 0.9,

1606

sources: ["biology textbook"]

1607

},

1608

metadata: {

1609

category: "science",

1610

difficulty: 5,

1611

tags: ["biology", "plants"]

1612

}

1613

}

1614

],

1615

task: async ({ input, metadata }) => {

1616

// input is typed as QuestionInput

1617

// metadata is typed as ItemMetadata

1618

const { question, context } = input;

1619

const difficulty = metadata?.difficulty || 5;

1620

1621

return await qaModel(question, context, difficulty);

1622

// Return type should match AnswerOutput

1623

},

1624

evaluators: [

1625

async ({ input, output, expectedOutput }) => {

1626

// All parameters are fully typed

1627

// input: QuestionInput

1628

// output: any (task output)

1629

// expectedOutput: AnswerOutput | undefined

1630

1631

return {

1632

name: "answer_quality",

1633

value: output.confidence

1634

};

1635

}

1636

]

1637

});

1638

1639

// Result is typed as ExperimentResult<QuestionInput, AnswerOutput, ItemMetadata>

1640

for (const itemResult of result.itemResults) {

1641

// itemResult.input is QuestionInput

1642

// itemResult.output is any

1643

// itemResult.expectedOutput is AnswerOutput | undefined

1644

console.log(itemResult.input.question);

1645

console.log(itemResult.expectedOutput?.confidence);

1646

}

1647

```

1648

1649

### Parallel vs Sequential Execution

1650

1651

Control experiment execution parallelism with maxConcurrency.

1652

1653

```typescript

1654

// Fully parallel (default)

1655

const parallelResult = await langfuse.experiment.run({

1656

name: "Parallel Execution",

1657

data: largeDataset,

1658

task: fastTask,

1659

evaluators: [evaluator]

1660

// maxConcurrency: Infinity (default)

1661

});

1662

1663

// Sequential execution

1664

const sequentialResult = await langfuse.experiment.run({

1665

name: "Sequential Execution",

1666

data: dataset,

1667

task: task,

1668

maxConcurrency: 1 // Process one item at a time

1669

});

1670

1671

// Controlled parallelism

1672

const controlledResult = await langfuse.experiment.run({

1673

name: "Rate Limited Execution",

1674

data: dataset,

1675

task: expensiveAPICall,

1676

maxConcurrency: 5 // Max 5 concurrent API calls

1677

});

1678

1679

// Batched processing

1680

const batchSize = 10;

1681

const batchedResult = await langfuse.experiment.run({

1682

name: "Batched Processing",

1683

data: veryLargeDataset,

1684

task: task,

1685

maxConcurrency: batchSize // Process in batches of 10

1686

});

1687

```

1688

1689

### Dataset Integration

1690

1691

Run experiments directly on Langfuse datasets with automatic linking.

1692

1693

```typescript

1694

// Get dataset

1695

const dataset = await langfuse.dataset.get("my-dataset");

1696

1697

// Run experiment on dataset (automatic data parameter)

1698

const result = await dataset.runExperiment({

1699

name: "GPT-4 Evaluation",

1700

task: async ({ input }) => {

1701

// Process dataset item

1702

return await model(input);

1703

},

1704

evaluators: [evaluator],

1705

runEvaluators: [averageEvaluator]

1706

});

1707

1708

// Results are automatically linked to dataset run

1709

console.log(`Dataset run ID: ${result.datasetRunId}`);

1710

console.log(`View in UI: ${result.datasetRunUrl}`);

1711

1712

// Each item result is linked

1713

for (const itemResult of result.itemResults) {

1714

console.log(`Dataset run ID: ${itemResult.datasetRunId}`);

1715

console.log(`Trace ID: ${itemResult.traceId}`);

1716

}

1717

1718

// Compare multiple runs on same dataset

1719

const run1 = await dataset.runExperiment({

1720

name: "Model A",

1721

runName: "model-a-run-1",

1722

task: modelA,

1723

evaluators: [evaluator]

1724

});

1725

1726

const run2 = await dataset.runExperiment({

1727

name: "Model B",

1728

runName: "model-b-run-1",

1729

task: modelB,

1730

evaluators: [evaluator]

1731

});

1732

1733

// Compare results

1734

console.log("Model A avg:", run1.runEvaluations[0].value);

1735

console.log("Model B avg:", run2.runEvaluations[0].value);

1736

```

1737

1738

### Result Formatting

1739

1740

Use the format() function to generate human-readable result summaries.

1741

1742

```typescript

1743

const result = await langfuse.experiment.run({

1744

name: "Test Experiment",

1745

data: testData,

1746

task: task,

1747

evaluators: [evaluator],

1748

runEvaluators: [runEvaluator]

1749

});

1750

1751

// Format summary (default)

1752

const summary = await result.format();

1753

console.log(summary);

1754

/*

1755

Individual Results: Hidden (50 items)

1756

๐Ÿ’ก Call format({ includeItemResults: true }) to view them

1757

1758

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

1759

๐Ÿงช Experiment: Test Experiment

1760

๐Ÿ“‹ Run name: Test Experiment - 2024-01-15T10:30:00.000Z

1761

50 items

1762

Evaluations:

1763

โ€ข accuracy

1764

โ€ข f1_score

1765

1766

Average Scores:

1767

โ€ข accuracy: 0.850

1768

โ€ข f1_score: 0.823

1769

1770

Run Evaluations:

1771

โ€ข average_accuracy: 0.850

1772

๐Ÿ’ญ Average accuracy: 85.0%

1773

โ€ข precision: 0.875

1774

๐Ÿ’ญ Precision: 87.5%

1775

1776

๐Ÿ”— Dataset Run:

1777

https://cloud.langfuse.com/project/xxx/datasets/yyy/runs/zzz

1778

*/

1779

1780

// Format with detailed item results

1781

const detailed = await result.format({ includeItemResults: true });

1782

console.log(detailed);

1783

/*

1784

1. Item 1:

1785

Input: What is the capital of France?

1786

Expected: Paris

1787

Actual: Paris

1788

Scores:

1789

โ€ข exact_match: 1.000

1790

โ€ข similarity: 1.000

1791

1792

Dataset Item:

1793

https://cloud.langfuse.com/project/xxx/datasets/yyy/items/123

1794

1795

Trace:

1796

https://cloud.langfuse.com/project/xxx/traces/abc123

1797

1798

2. Item 2:

1799

Input: What is 2+2?

1800

Expected: 4

1801

Actual: 4

1802

Scores:

1803

โ€ข exact_match: 1.000

1804

โ€ข similarity: 1.000

1805

1806

Trace:

1807

https://cloud.langfuse.com/project/xxx/traces/def456

1808

1809

... (50 items total)

1810

1811

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

1812

๐Ÿงช Experiment: Test Experiment

1813

... (summary as above)

1814

*/

1815

1816

// Save formatted results to file

1817

const formatted = await result.format({ includeItemResults: true });

1818

await fs.writeFile('experiment-results.txt', formatted);

1819

1820

// Use in CI/CD

1821

const summary = await result.format();

1822

console.log(summary);

1823

if (result.runEvaluations.some(e => e.name === "average_accuracy" && (e.value as number) < 0.8)) {

1824

throw new Error("Experiment failed: accuracy below threshold");

1825

}

1826

```

1827

1828

### Error Handling Strategies

1829

1830

Implement robust error handling for production experiments.

1831

1832

```typescript

1833

// Task with retry logic

1834

const resilientTask: ExperimentTask = async ({ input }) => {

1835

let lastError;

1836

for (let attempt = 0; attempt < 3; attempt++) {

1837

try {

1838

return await apiCall(input);

1839

} catch (error) {

1840

lastError = error;

1841

await new Promise(resolve => setTimeout(resolve, 1000 * (attempt + 1)));

1842

}

1843

}

1844

throw lastError;

1845

};

1846

1847

// Task with fallback

1848

const fallbackTask: ExperimentTask = async ({ input }) => {

1849

try {

1850

return await primaryModel(input);

1851

} catch (error) {

1852

console.warn("Primary model failed, using fallback");

1853

return await fallbackModel(input);

1854

}

1855

};

1856

1857

// Task with timeout

1858

const timeoutTask: ExperimentTask = async ({ input }) => {

1859

return await Promise.race([

1860

modelCall(input),

1861

new Promise((_, reject) =>

1862

setTimeout(() => reject(new Error("Timeout")), 30000)

1863

)

1864

]);

1865

};

1866

1867

// Evaluator with validation

1868

const validatingEvaluator: Evaluator = async ({ output, expectedOutput }) => {

1869

try {

1870

if (typeof output !== 'string' || typeof expectedOutput !== 'string') {

1871

throw new Error("Invalid output types");

1872

}

1873

1874

return {

1875

name: "accuracy",

1876

value: output === expectedOutput ? 1 : 0

1877

};

1878

} catch (error) {

1879

console.error("Evaluator validation failed:", error);

1880

return {

1881

name: "accuracy",

1882

value: 0,

1883

comment: `Validation error: ${error.message}`

1884

};

1885

}

1886

};

1887

1888

// Run experiment with error tracking

1889

const result = await langfuse.experiment.run({

1890

name: "Resilient Experiment",

1891

data: testData,

1892

task: resilientTask,

1893

evaluators: [validatingEvaluator]

1894

});

1895

1896

// Check for failures

1897

const successCount = result.itemResults.length;

1898

const totalCount = testData.length;

1899

const failureCount = totalCount - successCount;

1900

1901

if (failureCount > 0) {

1902

console.warn(`${failureCount} items failed during experiment`);

1903

}

1904

```

1905

1906

## Best Practices

1907

1908

### Experiment Organization

1909

1910

```typescript

1911

// โœ… Good: Descriptive naming

1912

await langfuse.experiment.run({

1913

name: "GPT-4 vs GPT-3.5 on QA Dataset",

1914

runName: "gpt-4-2024-01-15-temp-0.7",

1915

description: "Comparing model performance with temperature 0.7",

1916

metadata: {

1917

model_version: "gpt-4-0125-preview",

1918

temperature: 0.7,

1919

dataset_version: "v2.1"

1920

}

1921

});

1922

1923

// โŒ Bad: Generic naming

1924

await langfuse.experiment.run({

1925

name: "Test",

1926

data: data,

1927

task: task

1928

});

1929

```

1930

1931

### Evaluator Design

1932

1933

```typescript

1934

// โœ… Good: Multiple focused evaluators

1935

const evaluators = [

1936

// Simple binary check

1937

async ({ output, expectedOutput }) => ({

1938

name: "exact_match",

1939

value: output === expectedOutput ? 1 : 0

1940

}),

1941

// Similarity score

1942

async ({ output, expectedOutput }) => ({

1943

name: "cosine_similarity",

1944

value: calculateCosineSimilarity(output, expectedOutput)

1945

}),

1946

// Format validation

1947

async ({ output }) => ({

1948

name: "format_valid",

1949

value: validateFormat(output) ? 1 : 0

1950

})

1951

];

1952

1953

// โŒ Bad: One complex evaluator doing everything

1954

const badEvaluator = async ({ output, expectedOutput }) => ({

1955

name: "score",

1956

value: complexCalculation(output, expectedOutput)

1957

// Unclear what this represents

1958

});

1959

```

1960

1961

### Concurrency Management

1962

1963

```typescript

1964

// โœ… Good: Appropriate concurrency limits

1965

await langfuse.experiment.run({

1966

name: "Rate-Limited API Experiment",

1967

data: largeDataset,

1968

task: expensiveAPICall,

1969

maxConcurrency: 5, // Respect API rate limits

1970

evaluators: [evaluator]

1971

});

1972

1973

// โœ… Good: High concurrency for local operations

1974

await langfuse.experiment.run({

1975

name: "Local Model Experiment",

1976

data: dataset,

1977

task: localModelInference,

1978

maxConcurrency: 50, // Local model can handle high concurrency

1979

evaluators: [evaluator]

1980

});

1981

1982

// โŒ Bad: No concurrency control for rate-limited API

1983

await langfuse.experiment.run({

1984

name: "Uncontrolled Experiment",

1985

data: largeDataset,

1986

task: rateLimitedAPI

1987

// Will likely hit rate limits

1988

});

1989

```

1990

1991

### Type Safety

1992

1993

```typescript

1994

// โœ… Good: Explicit types

1995

interface Input {

1996

question: string;

1997

context: string;

1998

}

1999

2000

interface Output {

2001

answer: string;

2002

confidence: number;

2003

}

2004

2005

const result = await langfuse.experiment.run<Input, Output>({

2006

name: "Typed Experiment",

2007

data: [

2008

{

2009

input: { question: "...", context: "..." },

2010

expectedOutput: { answer: "...", confidence: 0.9 }

2011

}

2012

],

2013

task: async ({ input }) => {

2014

// input is typed as Input

2015

return await processTyped(input);

2016

}

2017

});

2018

2019

// โŒ Bad: Implicit any types

2020

const result = await langfuse.experiment.run({

2021

name: "Untyped Experiment",

2022

data: [{ input: someData }],

2023

task: async ({ input }) => {

2024

// input is any

2025

return await process(input);

2026

}

2027

});

2028

```

2029

2030

### Result Analysis

2031

2032

```typescript

2033

// โœ… Good: Use run evaluators for aggregates

2034

await langfuse.experiment.run({

2035

name: "Analysis Experiment",

2036

data: dataset,

2037

task: task,

2038

evaluators: [itemEvaluator],

2039

runEvaluators: [

2040

async ({ itemResults }) => {

2041

// Calculate aggregate metrics

2042

const avg = calculateAverage(itemResults);

2043

const stdDev = calculateStdDev(itemResults);

2044

2045

return [

2046

{ name: "average", value: avg },

2047

{ name: "std_dev", value: stdDev }

2048

];

2049

}

2050

]

2051

});

2052

2053

// โŒ Bad: Manual aggregation after experiment

2054

const result = await langfuse.experiment.run({

2055

name: "Manual Analysis",

2056

data: dataset,

2057

task: task,

2058

evaluators: [itemEvaluator]

2059

});

2060

2061

// Manually calculating aggregates (should use run evaluators)

2062

const scores = result.itemResults.map(r => r.evaluations[0].value);

2063

const avg = scores.reduce((a, b) => a + b) / scores.length;

2064

```

2065

2066

## Performance Considerations

2067

2068

### Batching and Concurrency

2069

2070

- Use `maxConcurrency` to control parallelism and avoid overwhelming external APIs

2071

- Default `maxConcurrency: Infinity` is suitable for local operations

2072

- Set `maxConcurrency: 1` for sequential processing when order matters

2073

- Typical values: 3-10 for API calls, 20-100 for local operations

2074

2075

### Memory Management

2076

2077

- Large datasets are processed in batches based on `maxConcurrency`

2078

- Each batch is processed completely before moving to the next

2079

- Failed items are logged and skipped, not stored in memory

2080

- Consider breaking very large experiments into multiple smaller runs

2081

2082

### Tracing Overhead

2083

2084

- OpenTelemetry tracing adds minimal overhead (~1-5ms per item)

2085

- Traces are sent asynchronously and don't block experiment execution

2086

- Disable tracing for maximum performance (though not recommended)

2087

- Use `flush()` to ensure all traces are sent before shutdown

2088

2089

### Evaluator Performance

2090

2091

- Item-level evaluators run in parallel with task execution

2092

- Failed evaluators don't block other evaluators

2093

- LLM-as-judge evaluators can be slow; use `maxConcurrency` to control them

2094

- Run-level evaluators execute sequentially after all items complete

2095