or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agentic-metrics.mdbenchmarks.mdcontent-quality-metrics.mdconversational-metrics.mdcore-evaluation.mdcustom-metrics.mddataset.mdindex.mdintegrations.mdmodels.mdmultimodal-metrics.mdrag-metrics.mdsynthesizer.mdtest-cases.mdtracing.md

index.mddocs/

0

# DeepEval

1

2

A comprehensive Python framework for evaluating and testing large language model (LLM) systems. DeepEval provides 50+ research-backed metrics for evaluating RAG pipelines, chatbots, AI agents, and other LLM applications. It operates like Pytest but specialized for LLM evaluation, supporting both end-to-end and component-level testing.

3

4

## Package Information

5

6

- **Package Name**: deepeval

7

- **Language**: Python

8

- **Installation**: `pip install -U deepeval`

9

- **Minimum Python Version**: 3.9+

10

11

## Core Imports

12

13

```python

14

import deepeval

15

```

16

17

Common imports for evaluation:

18

19

```python

20

from deepeval import evaluate, assert_test

21

from deepeval.test_case import LLMTestCase, ConversationalTestCase

22

from deepeval.metrics import GEval, AnswerRelevancyMetric, FaithfulnessMetric

23

from deepeval.dataset import EvaluationDataset, Golden

24

```

25

26

## Basic Usage

27

28

```python

29

from deepeval import assert_test

30

from deepeval.metrics import AnswerRelevancyMetric

31

from deepeval.test_case import LLMTestCase

32

33

# Create a test case

34

test_case = LLMTestCase(

35

input="What if these shoes don't fit?",

36

actual_output="You have 30 days to get a full refund at no extra cost.",

37

expected_output="We offer a 30-day full refund at no extra costs.",

38

retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]

39

)

40

41

# Create and run a metric

42

metric = AnswerRelevancyMetric(threshold=0.7)

43

assert_test(test_case, [metric])

44

```

45

46

Evaluating multiple test cases:

47

48

```python

49

from deepeval import evaluate

50

from deepeval.dataset import EvaluationDataset, Golden

51

from deepeval.metrics import FaithfulnessMetric

52

53

# Create a dataset

54

dataset = EvaluationDataset(

55

goldens=[

56

Golden(input="What's the refund policy?", expected_output="30-day full refund"),

57

Golden(input="How do I return items?", expected_output="Contact support for return label")

58

]

59

)

60

61

# Generate test cases and evaluate

62

for golden in dataset.goldens:

63

test_case = LLMTestCase(

64

input=golden.input,

65

actual_output=your_llm_app(golden.input),

66

expected_output=golden.expected_output

67

)

68

dataset.add_test_case(test_case)

69

70

# Evaluate entire dataset

71

evaluate(dataset, [FaithfulnessMetric()])

72

```

73

74

## Architecture

75

76

DeepEval's architecture consists of several key layers:

77

78

- **Test Cases**: Structured containers for inputs, outputs, and context (`LLMTestCase`, `ConversationalTestCase`, `MLLMTestCase`)

79

- **Metrics**: Evaluation criteria powered by LLMs, statistical methods, or NLP models (50+ built-in metrics)

80

- **Datasets**: Collections of test cases and "golden" examples for batch evaluation

81

- **Synthesizer**: Generates synthetic test data using various evolution strategies

82

- **Tracing**: Component-level observability with the `@observe` decorator for nested evaluations

83

- **Models**: Abstraction layer supporting 15+ LLM providers (OpenAI, Anthropic, local models, etc.)

84

- **Integrations**: Native support for LangChain, LlamaIndex, CrewAI, and PydanticAI

85

86

## Capabilities

87

88

### Test Cases

89

90

Test cases are structured containers representing LLM interactions to be evaluated. DeepEval supports standard LLM tests, multi-turn conversations, multimodal inputs, and arena-style comparisons.

91

92

```python { .api }

93

class LLMTestCase:

94

"""

95

Represents a test case for evaluating LLM outputs.

96

97

Parameters:

98

- input (str): Input prompt to the LLM

99

- actual_output (str, optional): Actual output from the LLM

100

- expected_output (str, optional): Expected output

101

- context (List[str], optional): Context information

102

- retrieval_context (List[str], optional): Retrieved context for RAG

103

- additional_metadata (Dict, optional): Additional metadata

104

- tools_called (List[ToolCall], optional): Tools called by the LLM

105

- expected_tools (List[ToolCall], optional): Expected tools to be called

106

- comments (str, optional): Comments about the test case

107

- name (str, optional): Name of the test case

108

- tags (List[str], optional): Tags for organization

109

"""

110

111

class ConversationalTestCase:

112

"""

113

Represents a multi-turn conversational test case.

114

115

Parameters:

116

- turns (List[Turn]): List of conversation turns

117

- scenario (str, optional): Scenario description

118

- context (List[str], optional): Context information

119

- expected_outcome (str, optional): Expected outcome

120

- name (str, optional): Name of the test case

121

"""

122

123

class MLLMTestCase:

124

"""

125

Represents a test case for multimodal LLMs (text + images).

126

127

Parameters:

128

- input (List[Union[str, MLLMImage]]): Input with text and images

129

- actual_output (List[Union[str, MLLMImage]]): Actual output

130

- expected_output (List[Union[str, MLLMImage]], optional): Expected output

131

- context (List[Union[str, MLLMImage]], optional): Context

132

"""

133

```

134

135

[Test Cases](./test-cases.md)

136

137

### Core Evaluation

138

139

Core evaluation functions for running metrics against test cases, either individually or in batch. Supports pytest integration, standalone evaluation, and comparison between models.

140

141

```python { .api }

142

def evaluate(

143

test_cases: Union[List[LLMTestCase], List[ConversationalTestCase], List[MLLMTestCase]],

144

metrics: Optional[Union[List[BaseMetric], List[BaseConversationalMetric], List[BaseMultimodalMetric]]] = None,

145

hyperparameters: Optional[Dict[str, Union[str, int, float]]] = None,

146

identifier: Optional[str] = None,

147

async_config: Optional[AsyncConfig] = None,

148

display_config: Optional[DisplayConfig] = None,

149

cache_config: Optional[CacheConfig] = None,

150

error_config: Optional[ErrorConfig] = None

151

) -> EvaluationResult:

152

"""

153

Evaluates test cases against specified metrics.

154

155

Returns:

156

- EvaluationResult: Contains test results, Confident AI link, and test run ID

157

"""

158

159

def assert_test(

160

test_case: Optional[Union[LLMTestCase, ConversationalTestCase, MLLMTestCase]],

161

metrics: Optional[Union[List[BaseMetric], List[BaseConversationalMetric], List[BaseMultimodalMetric]]] = None,

162

run_async: bool = True

163

):

164

"""

165

Asserts that a single test case passes specified metrics.

166

167

Raises:

168

- AssertionError: If metrics fail

169

"""

170

171

def compare(

172

test_cases: List[List[LLMTestCase]],

173

metrics: List[BaseMetric]

174

) -> ComparisonResult:

175

"""

176

Compares multiple test results to determine which performs better.

177

"""

178

```

179

180

[Core Evaluation](./core-evaluation.md)

181

182

### RAG Metrics

183

184

Metrics specifically designed for evaluating Retrieval-Augmented Generation (RAG) systems, measuring answer quality, faithfulness to context, and retrieval effectiveness.

185

186

```python { .api }

187

class AnswerRelevancyMetric:

188

"""

189

Measures whether the answer is relevant to the input question.

190

191

Parameters:

192

- threshold (float): Success threshold (default: 0.5)

193

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

194

- include_reason (bool): Include reason in output (default: True)

195

"""

196

197

class FaithfulnessMetric:

198

"""

199

Measures whether the answer is faithful to the context (no hallucinations).

200

"""

201

202

class ContextualRecallMetric:

203

"""

204

Measures whether the retrieved context contains all information needed.

205

"""

206

207

class ContextualRelevancyMetric:

208

"""

209

Measures whether the retrieved context is relevant to the input.

210

"""

211

212

class ContextualPrecisionMetric:

213

"""

214

Measures whether relevant context nodes are ranked higher than irrelevant ones.

215

"""

216

```

217

218

[RAG Metrics](./rag-metrics.md)

219

220

### Content Quality Metrics

221

222

Metrics for evaluating content safety, quality, and compliance, detecting issues like hallucinations, bias, toxicity, and PII leakage.

223

224

```python { .api }

225

class HallucinationMetric:

226

"""

227

Detects hallucinations in the output.

228

229

Parameters:

230

- threshold (float): Success threshold

231

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

232

"""

233

234

class BiasMetric:

235

"""

236

Detects bias in the output.

237

"""

238

239

class ToxicityMetric:

240

"""

241

Detects toxic content in the output.

242

"""

243

244

class SummarizationMetric:

245

"""

246

Evaluates the quality of summaries.

247

"""

248

249

class PIILeakageMetric:

250

"""

251

Detects personally identifiable information (PII) leakage.

252

"""

253

```

254

255

[Content Quality Metrics](./content-quality-metrics.md)

256

257

### Agentic Metrics

258

259

Metrics for evaluating AI agents, including tool usage, task completion, plan quality, and goal achievement.

260

261

```python { .api }

262

class ToolCorrectnessMetric:

263

"""

264

Evaluates whether the correct tools were called with correct parameters.

265

266

Parameters:

267

- threshold (float): Success threshold

268

"""

269

270

class TaskCompletionMetric:

271

"""

272

Evaluates whether the task was completed successfully.

273

"""

274

275

class ToolUseMetric:

276

"""

277

Evaluates appropriate use of available tools.

278

"""

279

280

class PlanQualityMetric:

281

"""

282

Evaluates the quality of generated plans.

283

"""

284

285

class GoalAccuracyMetric:

286

"""

287

Measures accuracy in achieving specified goals.

288

"""

289

```

290

291

[Agentic Metrics](./agentic-metrics.md)

292

293

### Conversational Metrics

294

295

Metrics designed for evaluating multi-turn conversations, measuring relevancy, completeness, and role adherence.

296

297

```python { .api }

298

class ConversationalGEval:

299

"""

300

G-Eval for conversational test cases.

301

302

Parameters:

303

- name (str): Name of the metric

304

- criteria (str): Evaluation criteria

305

- evaluation_params (List[TurnParams]): Parameters to evaluate

306

- threshold (float): Success threshold

307

"""

308

309

class TurnRelevancyMetric:

310

"""

311

Measures relevancy of conversation turns.

312

"""

313

314

class ConversationCompletenessMetric:

315

"""

316

Evaluates completeness of conversations.

317

"""

318

319

class RoleAdherenceMetric:

320

"""

321

Measures adherence to assigned role in conversations.

322

"""

323

```

324

325

[Conversational Metrics](./conversational-metrics.md)

326

327

### Multimodal Metrics

328

329

Metrics for evaluating multimodal LLM outputs involving text and images, including generation quality and contextual understanding.

330

331

```python { .api }

332

class MultimodalGEval:

333

"""

334

G-Eval for multimodal test cases.

335

336

Parameters:

337

- name (str): Name of the metric

338

- criteria (str): Evaluation criteria

339

- evaluation_params (List[MLLMTestCaseParams]): Parameters to evaluate

340

"""

341

342

class TextToImageMetric:

343

"""

344

Evaluates text-to-image generation quality.

345

"""

346

347

class ImageCoherenceMetric:

348

"""

349

Evaluates coherence of images in context.

350

"""

351

352

class MultimodalAnswerRelevancyMetric:

353

"""

354

Answer relevancy for multimodal inputs.

355

"""

356

357

class MultimodalFaithfulnessMetric:

358

"""

359

Faithfulness for multimodal outputs.

360

"""

361

```

362

363

[Multimodal Metrics](./multimodal-metrics.md)

364

365

### Custom Metrics

366

367

Framework for creating custom evaluation metrics using G-Eval, DAG (Deep Acyclic Graph), or by extending base metric classes.

368

369

```python { .api }

370

class GEval:

371

"""

372

Customizable metric based on the G-Eval framework for LLM evaluation.

373

374

Parameters:

375

- name (str): Name of the metric

376

- evaluation_params (List[LLMTestCaseParams]): Parameters to evaluate

377

- criteria (str, optional): Evaluation criteria

378

- evaluation_steps (List[str], optional): Steps for evaluation

379

- rubric (List[Rubric], optional): Scoring rubric

380

- threshold (float): Success threshold (default: 0.5)

381

"""

382

383

class DAGMetric:

384

"""

385

Deep Acyclic Graph metric for evaluating structured reasoning.

386

387

Parameters:

388

- name (str): Name of the metric

389

- dag (DeepAcyclicGraph): DAG structure for evaluation

390

- threshold (float): Success threshold

391

"""

392

393

class BaseMetric:

394

"""

395

Base class for all LLM test case metrics.

396

397

Abstract Methods:

398

- measure(test_case: LLMTestCase) -> float

399

- a_measure(test_case: LLMTestCase) -> float

400

- is_successful() -> bool

401

"""

402

```

403

404

[Custom Metrics](./custom-metrics.md)

405

406

### Datasets

407

408

Tools for managing collections of test cases and "golden" examples, supporting batch evaluation, synthetic data generation, and dataset persistence.

409

410

```python { .api }

411

class EvaluationDataset:

412

"""

413

Manages collections of test cases and goldens for evaluation.

414

415

Parameters:

416

- goldens (Union[List[Golden], List[ConversationalGolden]]): Initial goldens

417

418

Methods:

419

- add_test_case(test_case): Add a test case

420

- add_golden(golden): Add a golden

421

- generate_goldens_from_docs(document_paths, ...): Generate goldens from documents

422

- evaluate(metrics): Evaluate with metrics

423

- push(alias): Push to Confident AI

424

- pull(alias): Pull from Confident AI

425

"""

426

427

class Golden:

428

"""

429

Represents a "golden" test case - expected input/output pairs.

430

431

Parameters:

432

- input (str): Input prompt

433

- expected_output (str, optional): Expected output

434

- context (List[str], optional): Context

435

- retrieval_context (List[str], optional): Retrieved context

436

"""

437

438

class ConversationalGolden:

439

"""

440

Represents a "golden" conversational test case.

441

442

Parameters:

443

- scenario (str): Scenario description

444

- expected_outcome (str, optional): Expected outcome

445

- turns (List[Turn], optional): Conversation turns

446

"""

447

```

448

449

[Datasets](./dataset.md)

450

451

### Models

452

453

Model abstraction layer supporting 15+ LLM providers, multimodal models, and embedding models with a unified interface.

454

455

```python { .api }

456

class DeepEvalBaseLLM:

457

"""

458

Base class for LLM integrations.

459

460

Abstract Methods:

461

- generate(prompt: str) -> str

462

- a_generate(prompt: str) -> str

463

- get_model_name() -> str

464

"""

465

466

class GPTModel:

467

"""

468

OpenAI GPT model integration.

469

470

Parameters:

471

- model (str): Model name (e.g., "gpt-4", "gpt-3.5-turbo")

472

- api_key (str, optional): OpenAI API key

473

"""

474

475

class AnthropicModel:

476

"""

477

Anthropic Claude integration.

478

"""

479

480

class GeminiModel:

481

"""

482

Google Gemini integration.

483

"""

484

485

class OllamaModel:

486

"""

487

Ollama model integration for local models.

488

"""

489

490

class DeepEvalBaseMLLM:

491

"""

492

Base class for multimodal LLM integrations.

493

"""

494

```

495

496

[Models](./models.md)

497

498

### Synthesizer

499

500

Synthetic test data generation using various evolution strategies (reasoning, multi-context, concretizing, etc.) to create diverse and challenging test cases.

501

502

```python { .api }

503

class Synthesizer:

504

"""

505

Generates synthetic test data and goldens.

506

507

Parameters:

508

- model (Union[str, DeepEvalBaseLLM], optional): Model for generation

509

- async_mode (bool): Async mode (default: True)

510

- filtration_config (FiltrationConfig, optional): Filtration configuration

511

- evolution_config (EvolutionConfig, optional): Evolution configuration

512

- styling_config (StylingConfig, optional): Styling configuration

513

514

Methods:

515

- generate_goldens_from_docs(document_paths, ...) -> List[Golden]

516

- generate_goldens_from_contexts(contexts, ...) -> List[Golden]

517

- generate_goldens_from_scratch(num_goldens, ...) -> List[Golden]

518

- save_as(file_type, directory, ...)

519

"""

520

521

class Evolution:

522

"""

523

Enum of input evolution strategies.

524

525

Values:

526

- REASONING: Add reasoning complexity

527

- MULTICONTEXT: Require multiple contexts

528

- CONCRETIZING: Make more concrete

529

- CONSTRAINED: Add constraints

530

- COMPARATIVE: Add comparisons

531

- HYPOTHETICAL: Make hypothetical

532

"""

533

```

534

535

[Synthesizer](./synthesizer.md)

536

537

### Benchmarks

538

539

Pre-built benchmarks for evaluating LLMs on standard datasets like MMLU, HellaSwag, GSM8K, HumanEval, and more.

540

541

```python { .api }

542

class MMLU:

543

"""

544

Massive Multitask Language Understanding benchmark.

545

546

Parameters:

547

- tasks (List[MMLUTask], optional): Specific tasks to evaluate

548

- n_shots (int): Number of few-shot examples

549

"""

550

551

class HellaSwag:

552

"""

553

HellaSwag benchmark for commonsense reasoning.

554

"""

555

556

class GSM8K:

557

"""

558

Grade School Math 8K benchmark.

559

"""

560

561

class HumanEval:

562

"""

563

HumanEval benchmark for code generation.

564

"""

565

566

class BigBenchHard:

567

"""

568

Big Bench Hard benchmark.

569

"""

570

```

571

572

[Benchmarks](./benchmarks.md)

573

574

### Tracing

575

576

Component-level observability for evaluating nested LLM components using the `@observe` decorator and trace management.

577

578

```python { .api }

579

def observe(

580

metrics: Optional[List[BaseMetric]] = None,

581

name: Optional[str] = None

582

):

583

"""

584

Decorator for observing function execution and applying metrics.

585

586

Parameters:

587

- metrics (List[BaseMetric], optional): Metrics to apply

588

- name (str, optional): Name for the span

589

"""

590

591

def update_current_span(

592

test_case: Optional[LLMTestCase] = None,

593

**kwargs

594

):

595

"""

596

Updates the current span with additional data.

597

598

Parameters:

599

- test_case (LLMTestCase, optional): Test case data

600

- **kwargs: Additional span attributes

601

"""

602

603

def evaluate_trace(

604

trace_id: str,

605

metrics: List[BaseMetric]

606

):

607

"""

608

Evaluates a specific trace with metrics.

609

"""

610

```

611

612

[Tracing](./tracing.md)

613

614

### Integrations

615

616

Native integrations with popular LLM frameworks for automatic tracing and evaluation.

617

618

```python { .api }

619

# LangChain Integration

620

class CallbackHandler:

621

"""

622

LangChain callback handler for DeepEval tracing.

623

"""

624

625

def tool(func):

626

"""

627

Decorator for marking LangChain tools for tracing.

628

"""

629

630

# LlamaIndex Integration

631

def instrument_llama_index():

632

"""

633

Instruments LlamaIndex for automatic tracing.

634

"""

635

636

# CrewAI Integration

637

def instrument_crewai():

638

"""

639

Instruments CrewAI for automatic tracing.

640

"""

641

642

# PydanticAI Integration

643

def instrument_pydantic_ai():

644

"""

645

Instruments PydanticAI for automatic tracing.

646

"""

647

```

648

649

[Integrations](./integrations.md)

650

651

## Utility Functions

652

653

```python { .api }

654

def login(api_key: str = None):

655

"""

656

Logs into Confident AI with an API key.

657

658

Parameters:

659

- api_key (str, optional): Confident AI API key

660

"""

661

662

def log_hyperparameters(hyperparameters: Dict):

663

"""

664

Logs hyperparameters for the current test run.

665

666

Parameters:

667

- hyperparameters (Dict): Dictionary of hyperparameters to log

668

"""

669

670

def on_test_run_end(callback: Callable):

671

"""

672

Registers a callback to be executed when a test run ends.

673

674

Parameters:

675

- callback (Callable): Function to execute at test run end

676

"""

677

```

678