or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

autoevals-adapter.mdclient.mddatasets.mdexperiments.mdindex.mdmedia.mdprompts.mdscores.md

autoevals-adapter.mddocs/

0

# AutoEvals Integration

1

2

The AutoEvals Integration provides a seamless adapter for using evaluators from the [AutoEvals library](https://github.com/braintrustdata/autoevals) with Langfuse experiments. This adapter handles parameter mapping and result formatting automatically, allowing you to leverage battle-tested evaluation metrics without writing custom evaluation code.

3

4

## Capabilities

5

6

### createEvaluatorFromAutoevals

7

8

Convert AutoEvals evaluators to Langfuse-compatible evaluator functions with automatic parameter mapping.

9

10

```typescript { .api }

11

/**

12

* Converts an AutoEvals evaluator to a Langfuse-compatible evaluator function

13

*

14

* This adapter function bridges the gap between AutoEvals library evaluators

15

* and Langfuse experiment evaluators, handling parameter mapping and result

16

* formatting automatically.

17

*

18

* AutoEvals evaluators expect `input`, `output`, and `expected` parameters,

19

* while Langfuse evaluators use `input`, `output`, and `expectedOutput`.

20

* This function handles the parameter name mapping transparently.

21

*

22

* The adapter also transforms AutoEvals result format (with `name`, `score`,

23

* and `metadata` fields) to Langfuse evaluation format (with `name`, `value`,

24

* and `metadata` fields).

25

*

26

* @template E - Type of the AutoEvals evaluator function

27

* @param autoevalEvaluator - The AutoEvals evaluator function to convert

28

* @param params - Optional additional parameters to pass to the AutoEvals evaluator

29

* @returns A Langfuse-compatible evaluator function

30

*/

31

function createEvaluatorFromAutoevals<E extends CallableFunction>(

32

autoevalEvaluator: E,

33

params?: Params<E>

34

): Evaluator;

35

36

/**

37

* Utility type to extract parameter types from AutoEvals evaluator functions

38

*

39

* This type helper extracts the parameter type from an AutoEvals evaluator

40

* and omits the standard parameters (input, output, expected) that are

41

* handled by the adapter, leaving only the additional configuration parameters.

42

*

43

* @template E - The AutoEvals evaluator function type

44

*/

45

type Params<E> = Parameters<

46

E extends (...args: any[]) => any ? E : never

47

>[0] extends infer P

48

? Omit<P, "input" | "output" | "expected">

49

: never;

50

```

51

52

## Parameter Mapping

53

54

The adapter automatically handles the parameter name differences between AutoEvals and Langfuse:

55

56

| AutoEvals Parameter | Langfuse Parameter | Description |

57

|---------------------|-------------------|-------------|

58

| `input` | `input` | The input data passed to the task |

59

| `output` | `output` | The output produced by the task |

60

| `expected` | `expectedOutput` | The expected/ground truth output |

61

62

Additional parameters specified in the `params` argument are passed through to the AutoEvals evaluator without modification.

63

64

## Result Transformation

65

66

The adapter transforms AutoEvals results to Langfuse evaluation format:

67

68

```typescript

69

// AutoEvals result format

70

{

71

name: string;

72

score: number;

73

metadata?: Record<string, any>;

74

}

75

76

// Transformed to Langfuse format

77

{

78

name: string;

79

value: number; // mapped from score, defaults to 0 if undefined

80

metadata?: Record<string, any>;

81

}

82

```

83

84

## Usage Examples

85

86

### Basic Usage

87

88

Use AutoEvals evaluators directly with Langfuse experiments:

89

90

```typescript

91

import { Factuality, Levenshtein } from 'autoevals';

92

import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';

93

94

const langfuse = new LangfuseClient();

95

96

// Create wrapped evaluators

97

const factualityEvaluator = createEvaluatorFromAutoevals(Factuality);

98

const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);

99

100

// Use in experiment

101

const result = await langfuse.experiment.run({

102

name: "Capital Cities Test",

103

data: [

104

{ input: "France", expectedOutput: "Paris" },

105

{ input: "Germany", expectedOutput: "Berlin" },

106

{ input: "Japan", expectedOutput: "Tokyo" }

107

],

108

task: async ({ input }) => {

109

const response = await openai.chat.completions.create({

110

model: "gpt-4",

111

messages: [{

112

role: "user",

113

content: `What is the capital of ${input}?`

114

}]

115

});

116

return response.choices[0].message.content;

117

},

118

evaluators: [factualityEvaluator, levenshteinEvaluator]

119

});

120

121

console.log(await result.format());

122

```

123

124

### With Additional Parameters

125

126

Pass configuration parameters to AutoEvals evaluators:

127

128

```typescript

129

import { Factuality, ClosedQA, Battle } from 'autoevals';

130

import { createEvaluatorFromAutoevals } from '@langfuse/client';

131

132

// Configure Factuality evaluator with custom model

133

const factualityEvaluator = createEvaluatorFromAutoevals(

134

Factuality,

135

{ model: 'gpt-4o' }

136

);

137

138

// Configure ClosedQA with model and chain-of-thought

139

const closedQAEvaluator = createEvaluatorFromAutoevals(

140

ClosedQA,

141

{

142

model: 'gpt-4-turbo',

143

useCoT: true // Enable chain of thought reasoning

144

}

145

);

146

147

// Configure Battle evaluator for model comparison

148

const battleEvaluator = createEvaluatorFromAutoevals(

149

Battle,

150

{

151

model: 'gpt-4',

152

instructions: 'Compare which response is more accurate and helpful'

153

}

154

);

155

156

await langfuse.experiment.run({

157

name: "Configured Evaluators Test",

158

data: qaDataset,

159

task: myTask,

160

evaluators: [

161

factualityEvaluator,

162

closedQAEvaluator,

163

battleEvaluator

164

]

165

});

166

```

167

168

### Common AutoEvals Evaluators

169

170

Examples using popular AutoEvals evaluators:

171

172

```typescript

173

import {

174

Factuality,

175

Levenshtein,

176

ClosedQA,

177

Battle,

178

Humor,

179

Security,

180

Sql,

181

ValidJson,

182

AnswerRelevancy

183

} from 'autoevals';

184

import { createEvaluatorFromAutoevals } from '@langfuse/client';

185

186

// Text similarity and accuracy

187

const levenshteinEvaluator = createEvaluatorFromAutoevals(Levenshtein);

188

189

// Factuality checking (requires OpenAI)

190

const factualityEvaluator = createEvaluatorFromAutoevals(

191

Factuality,

192

{ model: 'gpt-4o' }

193

);

194

195

// Closed-domain QA evaluation

196

const closedQAEvaluator = createEvaluatorFromAutoevals(

197

ClosedQA,

198

{ model: 'gpt-4o' }

199

);

200

201

// Model comparison

202

const battleEvaluator = createEvaluatorFromAutoevals(

203

Battle,

204

{ model: 'gpt-4' }

205

);

206

207

// Humor detection

208

const humorEvaluator = createEvaluatorFromAutoevals(

209

Humor,

210

{ model: 'gpt-4o' }

211

);

212

213

// Security checking

214

const securityEvaluator = createEvaluatorFromAutoevals(

215

Security,

216

{ model: 'gpt-4o' }

217

);

218

219

// SQL validation

220

const sqlEvaluator = createEvaluatorFromAutoevals(Sql);

221

222

// JSON validation

223

const jsonEvaluator = createEvaluatorFromAutoevals(ValidJson);

224

225

// Answer relevancy

226

const relevancyEvaluator = createEvaluatorFromAutoevals(

227

AnswerRelevancy,

228

{ model: 'gpt-4o' }

229

);

230

231

// Use multiple evaluators for comprehensive assessment

232

await langfuse.experiment.run({

233

name: "Comprehensive QA Evaluation",

234

data: qaDataset,

235

task: qaTask,

236

evaluators: [

237

levenshteinEvaluator,

238

factualityEvaluator,

239

closedQAEvaluator,

240

relevancyEvaluator

241

]

242

});

243

```

244

245

### With Langfuse Datasets

246

247

Use AutoEvals evaluators when running experiments on Langfuse datasets:

248

249

```typescript

250

import { Factuality, Levenshtein } from 'autoevals';

251

import { createEvaluatorFromAutoevals } from '@langfuse/client';

252

253

const langfuse = new LangfuseClient();

254

255

// Fetch dataset from Langfuse

256

const dataset = await langfuse.dataset.get("qa-evaluation-dataset");

257

258

// Run experiment with AutoEvals evaluators

259

const result = await dataset.runExperiment({

260

name: "GPT-4 QA Evaluation",

261

description: "Evaluating GPT-4 performance on QA dataset",

262

task: async ({ input }) => {

263

const response = await openai.chat.completions.create({

264

model: "gpt-4",

265

messages: [{ role: "user", content: input }]

266

});

267

return response.choices[0].message.content;

268

},

269

evaluators: [

270

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

271

createEvaluatorFromAutoevals(Levenshtein)

272

]

273

});

274

275

console.log(`Dataset Run URL: ${result.datasetRunUrl}`);

276

console.log(await result.format());

277

```

278

279

### Combining AutoEvals and Custom Evaluators

280

281

Mix AutoEvals evaluators with your own custom evaluation logic:

282

283

```typescript

284

import { Factuality, Levenshtein } from 'autoevals';

285

import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';

286

287

// Custom evaluator

288

const exactMatchEvaluator: Evaluator = async ({ output, expectedOutput }) => ({

289

name: "exact_match",

290

value: output === expectedOutput ? 1 : 0,

291

comment: output === expectedOutput ? "Perfect match" : "No match"

292

});

293

294

// Custom evaluator with metadata

295

const lengthEvaluator: Evaluator = async ({ output, expectedOutput }) => {

296

const outputLen = output?.length || 0;

297

const expectedLen = expectedOutput?.length || 0;

298

const lengthDiff = Math.abs(outputLen - expectedLen);

299

300

return {

301

name: "length_similarity",

302

value: 1 - (lengthDiff / Math.max(outputLen, expectedLen, 1)),

303

metadata: {

304

outputLength: outputLen,

305

expectedLength: expectedLen,

306

difference: lengthDiff

307

}

308

};

309

};

310

311

// Custom multi-evaluation evaluator

312

const comprehensiveCustomEvaluator: Evaluator = async ({

313

input,

314

output,

315

expectedOutput

316

}) => {

317

return [

318

{

319

name: "contains_expected",

320

value: output.includes(expectedOutput) ? 1 : 0

321

},

322

{

323

name: "case_sensitive_match",

324

value: output === expectedOutput ? 1 : 0

325

},

326

{

327

name: "case_insensitive_match",

328

value: output.toLowerCase() === expectedOutput.toLowerCase() ? 1 : 0

329

}

330

];

331

};

332

333

// Combine everything

334

await langfuse.experiment.run({

335

name: "Mixed Evaluators Experiment",

336

data: dataset,

337

task: myTask,

338

evaluators: [

339

// AutoEvals evaluators

340

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

341

createEvaluatorFromAutoevals(Levenshtein),

342

// Custom evaluators

343

exactMatchEvaluator,

344

lengthEvaluator,

345

comprehensiveCustomEvaluator

346

]

347

});

348

```

349

350

### Advanced: Domain-Specific Evaluations

351

352

Configure AutoEvals evaluators for specific domains:

353

354

```typescript

355

import { Factuality, ClosedQA, Security } from 'autoevals';

356

import { createEvaluatorFromAutoevals } from '@langfuse/client';

357

358

// Medical QA evaluation

359

const medicalQAEvaluators = [

360

createEvaluatorFromAutoevals(Factuality, {

361

model: 'gpt-4o',

362

// Additional context can be provided through metadata

363

}),

364

createEvaluatorFromAutoevals(ClosedQA, {

365

model: 'gpt-4-turbo',

366

useCoT: true

367

})

368

];

369

370

await langfuse.experiment.run({

371

name: "Medical QA Evaluation",

372

description: "Evaluating medical question answering accuracy",

373

data: medicalQADataset,

374

task: medicalQATask,

375

evaluators: medicalQAEvaluators

376

});

377

378

// Code generation evaluation

379

const codeGenerationEvaluators = [

380

createEvaluatorFromAutoevals(Security, { model: 'gpt-4o' }),

381

createEvaluatorFromAutoevals(ValidJson), // If generating JSON

382

createEvaluatorFromAutoevals(Sql) // If generating SQL

383

];

384

385

await langfuse.experiment.run({

386

name: "Code Generation Quality",

387

description: "Evaluating generated code for security and validity",

388

data: codeGenDataset,

389

task: codeGenTask,

390

evaluators: codeGenerationEvaluators

391

});

392

393

// Creative writing evaluation

394

const creativeWritingEvaluators = [

395

createEvaluatorFromAutoevals(Humor, { model: 'gpt-4o' }),

396

createEvaluatorFromAutoevals(AnswerRelevancy, { model: 'gpt-4o' })

397

];

398

399

await langfuse.experiment.run({

400

name: "Creative Writing Assessment",

401

description: "Evaluating creative writing quality",

402

data: writingPromptsDataset,

403

task: writingTask,

404

evaluators: creativeWritingEvaluators

405

});

406

```

407

408

### Parallel Evaluation with Concurrency Control

409

410

Run experiments with AutoEvals evaluators and concurrency limits:

411

412

```typescript

413

import { Factuality, Levenshtein, ClosedQA } from 'autoevals';

414

import { createEvaluatorFromAutoevals } from '@langfuse/client';

415

416

const result = await langfuse.experiment.run({

417

name: "Large Scale Evaluation",

418

data: largeDataset, // 1000+ items

419

task: myTask,

420

evaluators: [

421

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

422

createEvaluatorFromAutoevals(Levenshtein),

423

createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })

424

],

425

maxConcurrency: 10 // Limit concurrent task executions

426

});

427

428

// Evaluators run in parallel for each item

429

// But only 10 items are processed concurrently

430

console.log(`Processed ${result.itemResults.length} items`);

431

console.log(await result.format());

432

```

433

434

## Integration Patterns

435

436

### Pattern 1: Standard AutoEvals Integration

437

438

The most common pattern for using AutoEvals evaluators:

439

440

```typescript

441

import { Factuality, Levenshtein } from 'autoevals';

442

import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';

443

444

const langfuse = new LangfuseClient();

445

446

// Step 1: Wrap AutoEvals evaluators

447

const evaluators = [

448

createEvaluatorFromAutoevals(Factuality),

449

createEvaluatorFromAutoevals(Levenshtein)

450

];

451

452

// Step 2: Run experiment

453

const result = await langfuse.experiment.run({

454

name: "My Experiment",

455

data: myData,

456

task: myTask,

457

evaluators

458

});

459

460

// Step 3: Review results

461

console.log(await result.format());

462

```

463

464

### Pattern 2: Configured AutoEvals Integration

465

466

Use when you need to pass custom parameters to AutoEvals:

467

468

```typescript

469

import { Factuality, ClosedQA } from 'autoevals';

470

import { createEvaluatorFromAutoevals } from '@langfuse/client';

471

472

// Configure evaluators with custom parameters

473

const evaluators = [

474

createEvaluatorFromAutoevals(Factuality, {

475

model: 'gpt-4o',

476

// model will be passed to AutoEvals Factuality evaluator

477

}),

478

createEvaluatorFromAutoevals(ClosedQA, {

479

model: 'gpt-4-turbo',

480

useCoT: true

481

})

482

];

483

484

await langfuse.experiment.run({

485

name: "Configured Evaluation",

486

data: myData,

487

task: myTask,

488

evaluators

489

});

490

```

491

492

### Pattern 3: Hybrid Evaluation Strategy

493

494

Combine AutoEvals evaluators with custom evaluation logic:

495

496

```typescript

497

import { Factuality } from 'autoevals';

498

import { createEvaluatorFromAutoevals, Evaluator } from '@langfuse/client';

499

500

const hybridEvaluators = [

501

// Use AutoEvals for complex evaluations

502

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

503

504

// Use custom evaluators for domain-specific logic

505

async ({ output, expectedOutput, metadata }): Promise<Evaluation> => ({

506

name: "business_rule_check",

507

value: checkBusinessRules(output, metadata) ? 1 : 0,

508

comment: "Domain-specific business rule validation"

509

})

510

];

511

512

await langfuse.experiment.run({

513

name: "Hybrid Evaluation",

514

data: myData,

515

task: myTask,

516

evaluators: hybridEvaluators

517

});

518

```

519

520

### Pattern 4: Progressive Evaluation

521

522

Start with simple evaluators and add more complex ones:

523

524

```typescript

525

import { Levenshtein, Factuality, ClosedQA } from 'autoevals';

526

import { createEvaluatorFromAutoevals } from '@langfuse/client';

527

528

// Phase 1: Quick evaluation with simple metrics

529

const quickEvaluators = [

530

createEvaluatorFromAutoevals(Levenshtein)

531

];

532

533

const quickResult = await langfuse.experiment.run({

534

name: "Quick Evaluation - Phase 1",

535

data: myData,

536

task: myTask,

537

evaluators: quickEvaluators

538

});

539

540

// Analyze quick results...

541

console.log(await quickResult.format());

542

543

// Phase 2: Deep evaluation with LLM-based metrics

544

const deepEvaluators = [

545

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

546

createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o', useCoT: true })

547

];

548

549

const deepResult = await langfuse.experiment.run({

550

name: "Deep Evaluation - Phase 2",

551

data: myData,

552

task: myTask,

553

evaluators: deepEvaluators

554

});

555

556

console.log(await deepResult.format());

557

```

558

559

## Best Practices

560

561

### 1. Choose Appropriate Evaluators

562

563

Select AutoEvals evaluators that match your evaluation needs:

564

565

```typescript

566

// For factual accuracy - use Factuality

567

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' })

568

569

// For text similarity - use Levenshtein

570

createEvaluatorFromAutoevals(Levenshtein)

571

572

// For closed-domain QA - use ClosedQA

573

createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' })

574

575

// For comparing two outputs - use Battle

576

createEvaluatorFromAutoevals(Battle, { model: 'gpt-4' })

577

578

// For code validation - use Sql, ValidJson, etc.

579

createEvaluatorFromAutoevals(ValidJson)

580

```

581

582

### 2. Configure Model Parameters

583

584

Always specify model parameters for LLM-based AutoEvals evaluators:

585

586

```typescript

587

// Good: Explicit model configuration

588

const evaluator = createEvaluatorFromAutoevals(Factuality, {

589

model: 'gpt-4o'

590

});

591

592

// Less ideal: Relying on defaults (may vary)

593

const evaluator = createEvaluatorFromAutoevals(Factuality);

594

```

595

596

### 3. Mix Evaluator Types

597

598

Combine different types of evaluators for comprehensive assessment:

599

600

```typescript

601

const evaluators = [

602

// Fast, deterministic evaluators

603

createEvaluatorFromAutoevals(Levenshtein),

604

createEvaluatorFromAutoevals(ValidJson),

605

606

// LLM-based evaluators

607

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

608

createEvaluatorFromAutoevals(ClosedQA, { model: 'gpt-4o' }),

609

610

// Custom domain-specific evaluators

611

customBusinessLogicEvaluator

612

];

613

```

614

615

### 4. Handle Evaluation Costs

616

617

Be mindful of API costs when using LLM-based AutoEvals evaluators:

618

619

```typescript

620

// For large datasets, start with cheaper evaluators

621

const result = await langfuse.experiment.run({

622

name: "Cost-Conscious Evaluation",

623

data: largeDataset,

624

task: myTask,

625

evaluators: [

626

// Free/cheap evaluators

627

createEvaluatorFromAutoevals(Levenshtein),

628

629

// Use GPT-4 selectively or use cheaper models

630

createEvaluatorFromAutoevals(Factuality, {

631

model: 'gpt-3.5-turbo' // Cheaper alternative

632

})

633

],

634

maxConcurrency: 5 // Control rate limiting

635

});

636

```

637

638

### 5. Understand Parameter Mapping

639

640

Remember that the adapter automatically maps parameters:

641

642

```typescript

643

// Your Langfuse data

644

const data = [

645

{

646

input: "What is 2+2?",

647

expectedOutput: "4" // Note: expectedOutput (Langfuse format)

648

}

649

];

650

651

// AutoEvals receives:

652

// {

653

// input: "What is 2+2?",

654

// output: <task result>,

655

// expected: "4" // Automatically mapped from expectedOutput

656

// }

657

658

const evaluator = createEvaluatorFromAutoevals(Factuality);

659

```

660

661

### 6. Test Evaluators Individually

662

663

Test AutoEvals evaluators with sample data before full experiments:

664

665

```typescript

666

import { Factuality } from 'autoevals';

667

import { createEvaluatorFromAutoevals } from '@langfuse/client';

668

669

// Create evaluator

670

const factualityEvaluator = createEvaluatorFromAutoevals(

671

Factuality,

672

{ model: 'gpt-4o' }

673

);

674

675

// Test with sample data

676

const testResult = await langfuse.experiment.run({

677

name: "Evaluator Test",

678

data: [

679

{ input: "Test input", expectedOutput: "Test output" }

680

],

681

task: async () => "Test result",

682

evaluators: [factualityEvaluator]

683

});

684

685

console.log(await testResult.format());

686

// Verify evaluator works as expected before scaling up

687

```

688

689

### 7. Monitor Evaluation Results

690

691

Track evaluation scores across experiments:

692

693

```typescript

694

const result = await langfuse.experiment.run({

695

name: "Production Evaluation",

696

data: productionDataset,

697

task: productionTask,

698

evaluators: [

699

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

700

createEvaluatorFromAutoevals(Levenshtein)

701

]

702

});

703

704

// Analyze scores

705

const factualityScores = result.itemResults

706

.flatMap(r => r.evaluations)

707

.filter(e => e.name === 'Factuality')

708

.map(e => e.value);

709

710

const avgFactuality = factualityScores.reduce((a, b) => a + b, 0)

711

/ factualityScores.length;

712

713

console.log(`Average Factuality Score: ${avgFactuality}`);

714

715

// View detailed results in Langfuse UI

716

if (result.datasetRunUrl) {

717

console.log(`View results: ${result.datasetRunUrl}`);

718

}

719

```

720

721

## Type Safety

722

723

The adapter provides full TypeScript type safety through the `Params<E>` utility type:

724

725

```typescript

726

import { Factuality } from 'autoevals';

727

import { createEvaluatorFromAutoevals } from '@langfuse/client';

728

729

// Type-safe parameter inference

730

const evaluator = createEvaluatorFromAutoevals(

731

Factuality,

732

{

733

model: 'gpt-4o', // ✓ Valid parameter

734

temperature: 0.7, // ✓ Valid parameter (if supported by Factuality)

735

// @ts-expect-error: input/output/expected are handled by adapter

736

input: "test", // ✗ Error: input is omitted from params

737

output: "test", // ✗ Error: output is omitted from params

738

expected: "test" // ✗ Error: expected is omitted from params

739

}

740

);

741

742

// The Params<E> type automatically:

743

// 1. Extracts parameter type from the evaluator function

744

// 2. Omits 'input', 'output', and 'expected' fields

745

// 3. Leaves only additional configuration parameters

746

```

747

748

## Error Handling

749

750

The adapter handles evaluation failures gracefully:

751

752

```typescript

753

import { Factuality } from 'autoevals';

754

import { createEvaluatorFromAutoevals } from '@langfuse/client';

755

756

const result = await langfuse.experiment.run({

757

name: "Error Handling Test",

758

data: myData,

759

task: myTask,

760

evaluators: [

761

createEvaluatorFromAutoevals(Factuality, { model: 'gpt-4o' }),

762

createEvaluatorFromAutoevals(Levenshtein)

763

]

764

});

765

766

// If one evaluator fails, others continue

767

// Failed evaluations are omitted from results

768

result.itemResults.forEach(item => {

769

console.log(`Item evaluations: ${item.evaluations.length}`);

770

// May have fewer evaluations if some failed

771

});

772

```

773

774

## Requirements

775

776

To use the AutoEvals integration, you need:

777

778

1. **Install AutoEvals**: `npm install autoevals`

779

2. **Install Langfuse Client**: `npm install @langfuse/client`

780

3. **API Keys**: Configure API keys for LLM-based evaluators (e.g., OpenAI API key for Factuality, ClosedQA, etc.)

781

782

```typescript

783

// Set up environment variables for LLM-based evaluators

784

// export OPENAI_API_KEY=your_openai_api_key

785

786

import { Factuality } from 'autoevals';

787

import { createEvaluatorFromAutoevals, LangfuseClient } from '@langfuse/client';

788

789

// LLM-based evaluators will use OPENAI_API_KEY from environment

790

const factualityEvaluator = createEvaluatorFromAutoevals(

791

Factuality,

792

{ model: 'gpt-4o' }

793

);

794

```

795

796

## Related Documentation

797

798

- [Experiment Execution](/docs/experiments.md) - Complete experiment system documentation

799

- [Evaluator Types](/docs/experiments.md#evaluators) - Understanding evaluator functions

800

- [Dataset Management](/docs/datasets.md) - Working with Langfuse datasets

801

- [AutoEvals Library](https://github.com/braintrustdata/autoevals) - Official AutoEvals documentation

802

803

## Summary

804

805

The AutoEvals adapter provides:

806

807

- **Automatic Parameter Mapping**: Transparently maps Langfuse parameters to AutoEvals format

808

- **Result Transformation**: Converts AutoEvals results to Langfuse evaluation format

809

- **Type Safety**: Full TypeScript support with the `Params<E>` utility type

810

- **Seamless Integration**: Works with both `langfuse.experiment.run()` and `dataset.runExperiment()`

811

- **Flexible Configuration**: Pass custom parameters to AutoEvals evaluators

812

- **Hybrid Evaluation**: Mix AutoEvals and custom evaluators in the same experiment

813

814

This adapter enables you to leverage the comprehensive suite of AutoEvals metrics without writing custom evaluation code, while maintaining full compatibility with Langfuse's experiment system.

815