0
# DeepEval
1
2
A comprehensive Python framework for evaluating and testing large language model (LLM) systems. DeepEval provides 50+ research-backed metrics for evaluating RAG pipelines, chatbots, AI agents, and other LLM applications. It operates like Pytest but specialized for LLM evaluation, supporting both end-to-end and component-level testing.
3
4
## Package Information
5
6
- **Package Name**: deepeval
7
- **Language**: Python
8
- **Installation**: `pip install -U deepeval`
9
- **Minimum Python Version**: 3.9+
10
11
## Core Imports
12
13
```python
14
import deepeval
15
```
16
17
Common imports for evaluation:
18
19
```python
20
from deepeval import evaluate, assert_test
21
from deepeval.test_case import LLMTestCase, ConversationalTestCase
22
from deepeval.metrics import GEval, AnswerRelevancyMetric, FaithfulnessMetric
23
from deepeval.dataset import EvaluationDataset, Golden
24
```
25
26
## Basic Usage
27
28
```python
29
from deepeval import assert_test
30
from deepeval.metrics import AnswerRelevancyMetric
31
from deepeval.test_case import LLMTestCase
32
33
# Create a test case
34
test_case = LLMTestCase(
35
input="What if these shoes don't fit?",
36
actual_output="You have 30 days to get a full refund at no extra cost.",
37
expected_output="We offer a 30-day full refund at no extra costs.",
38
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
39
)
40
41
# Create and run a metric
42
metric = AnswerRelevancyMetric(threshold=0.7)
43
assert_test(test_case, [metric])
44
```
45
46
Evaluating multiple test cases:
47
48
```python
49
from deepeval import evaluate
50
from deepeval.dataset import EvaluationDataset, Golden
51
from deepeval.metrics import FaithfulnessMetric
52
53
# Create a dataset
54
dataset = EvaluationDataset(
55
goldens=[
56
Golden(input="What's the refund policy?", expected_output="30-day full refund"),
57
Golden(input="How do I return items?", expected_output="Contact support for return label")
58
]
59
)
60
61
# Generate test cases and evaluate
62
for golden in dataset.goldens:
63
test_case = LLMTestCase(
64
input=golden.input,
65
actual_output=your_llm_app(golden.input),
66
expected_output=golden.expected_output
67
)
68
dataset.add_test_case(test_case)
69
70
# Evaluate entire dataset
71
evaluate(dataset, [FaithfulnessMetric()])
72
```
73
74
## Architecture
75
76
DeepEval's architecture consists of several key layers:
77
78
- **Test Cases**: Structured containers for inputs, outputs, and context (`LLMTestCase`, `ConversationalTestCase`, `MLLMTestCase`)
79
- **Metrics**: Evaluation criteria powered by LLMs, statistical methods, or NLP models (50+ built-in metrics)
80
- **Datasets**: Collections of test cases and "golden" examples for batch evaluation
81
- **Synthesizer**: Generates synthetic test data using various evolution strategies
82
- **Tracing**: Component-level observability with the `@observe` decorator for nested evaluations
83
- **Models**: Abstraction layer supporting 15+ LLM providers (OpenAI, Anthropic, local models, etc.)
84
- **Integrations**: Native support for LangChain, LlamaIndex, CrewAI, and PydanticAI
85
86
## Capabilities
87
88
### Test Cases
89
90
Test cases are structured containers representing LLM interactions to be evaluated. DeepEval supports standard LLM tests, multi-turn conversations, multimodal inputs, and arena-style comparisons.
91
92
```python { .api }
93
class LLMTestCase:
94
"""
95
Represents a test case for evaluating LLM outputs.
96
97
Parameters:
98
- input (str): Input prompt to the LLM
99
- actual_output (str, optional): Actual output from the LLM
100
- expected_output (str, optional): Expected output
101
- context (List[str], optional): Context information
102
- retrieval_context (List[str], optional): Retrieved context for RAG
103
- additional_metadata (Dict, optional): Additional metadata
104
- tools_called (List[ToolCall], optional): Tools called by the LLM
105
- expected_tools (List[ToolCall], optional): Expected tools to be called
106
- comments (str, optional): Comments about the test case
107
- name (str, optional): Name of the test case
108
- tags (List[str], optional): Tags for organization
109
"""
110
111
class ConversationalTestCase:
112
"""
113
Represents a multi-turn conversational test case.
114
115
Parameters:
116
- turns (List[Turn]): List of conversation turns
117
- scenario (str, optional): Scenario description
118
- context (List[str], optional): Context information
119
- expected_outcome (str, optional): Expected outcome
120
- name (str, optional): Name of the test case
121
"""
122
123
class MLLMTestCase:
124
"""
125
Represents a test case for multimodal LLMs (text + images).
126
127
Parameters:
128
- input (List[Union[str, MLLMImage]]): Input with text and images
129
- actual_output (List[Union[str, MLLMImage]]): Actual output
130
- expected_output (List[Union[str, MLLMImage]], optional): Expected output
131
- context (List[Union[str, MLLMImage]], optional): Context
132
"""
133
```
134
135
[Test Cases](./test-cases.md)
136
137
### Core Evaluation
138
139
Core evaluation functions for running metrics against test cases, either individually or in batch. Supports pytest integration, standalone evaluation, and comparison between models.
140
141
```python { .api }
142
def evaluate(
143
test_cases: Union[List[LLMTestCase], List[ConversationalTestCase], List[MLLMTestCase]],
144
metrics: Optional[Union[List[BaseMetric], List[BaseConversationalMetric], List[BaseMultimodalMetric]]] = None,
145
hyperparameters: Optional[Dict[str, Union[str, int, float]]] = None,
146
identifier: Optional[str] = None,
147
async_config: Optional[AsyncConfig] = None,
148
display_config: Optional[DisplayConfig] = None,
149
cache_config: Optional[CacheConfig] = None,
150
error_config: Optional[ErrorConfig] = None
151
) -> EvaluationResult:
152
"""
153
Evaluates test cases against specified metrics.
154
155
Returns:
156
- EvaluationResult: Contains test results, Confident AI link, and test run ID
157
"""
158
159
def assert_test(
160
test_case: Optional[Union[LLMTestCase, ConversationalTestCase, MLLMTestCase]],
161
metrics: Optional[Union[List[BaseMetric], List[BaseConversationalMetric], List[BaseMultimodalMetric]]] = None,
162
run_async: bool = True
163
):
164
"""
165
Asserts that a single test case passes specified metrics.
166
167
Raises:
168
- AssertionError: If metrics fail
169
"""
170
171
def compare(
172
test_cases: List[List[LLMTestCase]],
173
metrics: List[BaseMetric]
174
) -> ComparisonResult:
175
"""
176
Compares multiple test results to determine which performs better.
177
"""
178
```
179
180
[Core Evaluation](./core-evaluation.md)
181
182
### RAG Metrics
183
184
Metrics specifically designed for evaluating Retrieval-Augmented Generation (RAG) systems, measuring answer quality, faithfulness to context, and retrieval effectiveness.
185
186
```python { .api }
187
class AnswerRelevancyMetric:
188
"""
189
Measures whether the answer is relevant to the input question.
190
191
Parameters:
192
- threshold (float): Success threshold (default: 0.5)
193
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
194
- include_reason (bool): Include reason in output (default: True)
195
"""
196
197
class FaithfulnessMetric:
198
"""
199
Measures whether the answer is faithful to the context (no hallucinations).
200
"""
201
202
class ContextualRecallMetric:
203
"""
204
Measures whether the retrieved context contains all information needed.
205
"""
206
207
class ContextualRelevancyMetric:
208
"""
209
Measures whether the retrieved context is relevant to the input.
210
"""
211
212
class ContextualPrecisionMetric:
213
"""
214
Measures whether relevant context nodes are ranked higher than irrelevant ones.
215
"""
216
```
217
218
[RAG Metrics](./rag-metrics.md)
219
220
### Content Quality Metrics
221
222
Metrics for evaluating content safety, quality, and compliance, detecting issues like hallucinations, bias, toxicity, and PII leakage.
223
224
```python { .api }
225
class HallucinationMetric:
226
"""
227
Detects hallucinations in the output.
228
229
Parameters:
230
- threshold (float): Success threshold
231
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
232
"""
233
234
class BiasMetric:
235
"""
236
Detects bias in the output.
237
"""
238
239
class ToxicityMetric:
240
"""
241
Detects toxic content in the output.
242
"""
243
244
class SummarizationMetric:
245
"""
246
Evaluates the quality of summaries.
247
"""
248
249
class PIILeakageMetric:
250
"""
251
Detects personally identifiable information (PII) leakage.
252
"""
253
```
254
255
[Content Quality Metrics](./content-quality-metrics.md)
256
257
### Agentic Metrics
258
259
Metrics for evaluating AI agents, including tool usage, task completion, plan quality, and goal achievement.
260
261
```python { .api }
262
class ToolCorrectnessMetric:
263
"""
264
Evaluates whether the correct tools were called with correct parameters.
265
266
Parameters:
267
- threshold (float): Success threshold
268
"""
269
270
class TaskCompletionMetric:
271
"""
272
Evaluates whether the task was completed successfully.
273
"""
274
275
class ToolUseMetric:
276
"""
277
Evaluates appropriate use of available tools.
278
"""
279
280
class PlanQualityMetric:
281
"""
282
Evaluates the quality of generated plans.
283
"""
284
285
class GoalAccuracyMetric:
286
"""
287
Measures accuracy in achieving specified goals.
288
"""
289
```
290
291
[Agentic Metrics](./agentic-metrics.md)
292
293
### Conversational Metrics
294
295
Metrics designed for evaluating multi-turn conversations, measuring relevancy, completeness, and role adherence.
296
297
```python { .api }
298
class ConversationalGEval:
299
"""
300
G-Eval for conversational test cases.
301
302
Parameters:
303
- name (str): Name of the metric
304
- criteria (str): Evaluation criteria
305
- evaluation_params (List[TurnParams]): Parameters to evaluate
306
- threshold (float): Success threshold
307
"""
308
309
class TurnRelevancyMetric:
310
"""
311
Measures relevancy of conversation turns.
312
"""
313
314
class ConversationCompletenessMetric:
315
"""
316
Evaluates completeness of conversations.
317
"""
318
319
class RoleAdherenceMetric:
320
"""
321
Measures adherence to assigned role in conversations.
322
"""
323
```
324
325
[Conversational Metrics](./conversational-metrics.md)
326
327
### Multimodal Metrics
328
329
Metrics for evaluating multimodal LLM outputs involving text and images, including generation quality and contextual understanding.
330
331
```python { .api }
332
class MultimodalGEval:
333
"""
334
G-Eval for multimodal test cases.
335
336
Parameters:
337
- name (str): Name of the metric
338
- criteria (str): Evaluation criteria
339
- evaluation_params (List[MLLMTestCaseParams]): Parameters to evaluate
340
"""
341
342
class TextToImageMetric:
343
"""
344
Evaluates text-to-image generation quality.
345
"""
346
347
class ImageCoherenceMetric:
348
"""
349
Evaluates coherence of images in context.
350
"""
351
352
class MultimodalAnswerRelevancyMetric:
353
"""
354
Answer relevancy for multimodal inputs.
355
"""
356
357
class MultimodalFaithfulnessMetric:
358
"""
359
Faithfulness for multimodal outputs.
360
"""
361
```
362
363
[Multimodal Metrics](./multimodal-metrics.md)
364
365
### Custom Metrics
366
367
Framework for creating custom evaluation metrics using G-Eval, DAG (Deep Acyclic Graph), or by extending base metric classes.
368
369
```python { .api }
370
class GEval:
371
"""
372
Customizable metric based on the G-Eval framework for LLM evaluation.
373
374
Parameters:
375
- name (str): Name of the metric
376
- evaluation_params (List[LLMTestCaseParams]): Parameters to evaluate
377
- criteria (str, optional): Evaluation criteria
378
- evaluation_steps (List[str], optional): Steps for evaluation
379
- rubric (List[Rubric], optional): Scoring rubric
380
- threshold (float): Success threshold (default: 0.5)
381
"""
382
383
class DAGMetric:
384
"""
385
Deep Acyclic Graph metric for evaluating structured reasoning.
386
387
Parameters:
388
- name (str): Name of the metric
389
- dag (DeepAcyclicGraph): DAG structure for evaluation
390
- threshold (float): Success threshold
391
"""
392
393
class BaseMetric:
394
"""
395
Base class for all LLM test case metrics.
396
397
Abstract Methods:
398
- measure(test_case: LLMTestCase) -> float
399
- a_measure(test_case: LLMTestCase) -> float
400
- is_successful() -> bool
401
"""
402
```
403
404
[Custom Metrics](./custom-metrics.md)
405
406
### Datasets
407
408
Tools for managing collections of test cases and "golden" examples, supporting batch evaluation, synthetic data generation, and dataset persistence.
409
410
```python { .api }
411
class EvaluationDataset:
412
"""
413
Manages collections of test cases and goldens for evaluation.
414
415
Parameters:
416
- goldens (Union[List[Golden], List[ConversationalGolden]]): Initial goldens
417
418
Methods:
419
- add_test_case(test_case): Add a test case
420
- add_golden(golden): Add a golden
421
- generate_goldens_from_docs(document_paths, ...): Generate goldens from documents
422
- evaluate(metrics): Evaluate with metrics
423
- push(alias): Push to Confident AI
424
- pull(alias): Pull from Confident AI
425
"""
426
427
class Golden:
428
"""
429
Represents a "golden" test case - expected input/output pairs.
430
431
Parameters:
432
- input (str): Input prompt
433
- expected_output (str, optional): Expected output
434
- context (List[str], optional): Context
435
- retrieval_context (List[str], optional): Retrieved context
436
"""
437
438
class ConversationalGolden:
439
"""
440
Represents a "golden" conversational test case.
441
442
Parameters:
443
- scenario (str): Scenario description
444
- expected_outcome (str, optional): Expected outcome
445
- turns (List[Turn], optional): Conversation turns
446
"""
447
```
448
449
[Datasets](./dataset.md)
450
451
### Models
452
453
Model abstraction layer supporting 15+ LLM providers, multimodal models, and embedding models with a unified interface.
454
455
```python { .api }
456
class DeepEvalBaseLLM:
457
"""
458
Base class for LLM integrations.
459
460
Abstract Methods:
461
- generate(prompt: str) -> str
462
- a_generate(prompt: str) -> str
463
- get_model_name() -> str
464
"""
465
466
class GPTModel:
467
"""
468
OpenAI GPT model integration.
469
470
Parameters:
471
- model (str): Model name (e.g., "gpt-4", "gpt-3.5-turbo")
472
- api_key (str, optional): OpenAI API key
473
"""
474
475
class AnthropicModel:
476
"""
477
Anthropic Claude integration.
478
"""
479
480
class GeminiModel:
481
"""
482
Google Gemini integration.
483
"""
484
485
class OllamaModel:
486
"""
487
Ollama model integration for local models.
488
"""
489
490
class DeepEvalBaseMLLM:
491
"""
492
Base class for multimodal LLM integrations.
493
"""
494
```
495
496
[Models](./models.md)
497
498
### Synthesizer
499
500
Synthetic test data generation using various evolution strategies (reasoning, multi-context, concretizing, etc.) to create diverse and challenging test cases.
501
502
```python { .api }
503
class Synthesizer:
504
"""
505
Generates synthetic test data and goldens.
506
507
Parameters:
508
- model (Union[str, DeepEvalBaseLLM], optional): Model for generation
509
- async_mode (bool): Async mode (default: True)
510
- filtration_config (FiltrationConfig, optional): Filtration configuration
511
- evolution_config (EvolutionConfig, optional): Evolution configuration
512
- styling_config (StylingConfig, optional): Styling configuration
513
514
Methods:
515
- generate_goldens_from_docs(document_paths, ...) -> List[Golden]
516
- generate_goldens_from_contexts(contexts, ...) -> List[Golden]
517
- generate_goldens_from_scratch(num_goldens, ...) -> List[Golden]
518
- save_as(file_type, directory, ...)
519
"""
520
521
class Evolution:
522
"""
523
Enum of input evolution strategies.
524
525
Values:
526
- REASONING: Add reasoning complexity
527
- MULTICONTEXT: Require multiple contexts
528
- CONCRETIZING: Make more concrete
529
- CONSTRAINED: Add constraints
530
- COMPARATIVE: Add comparisons
531
- HYPOTHETICAL: Make hypothetical
532
"""
533
```
534
535
[Synthesizer](./synthesizer.md)
536
537
### Benchmarks
538
539
Pre-built benchmarks for evaluating LLMs on standard datasets like MMLU, HellaSwag, GSM8K, HumanEval, and more.
540
541
```python { .api }
542
class MMLU:
543
"""
544
Massive Multitask Language Understanding benchmark.
545
546
Parameters:
547
- tasks (List[MMLUTask], optional): Specific tasks to evaluate
548
- n_shots (int): Number of few-shot examples
549
"""
550
551
class HellaSwag:
552
"""
553
HellaSwag benchmark for commonsense reasoning.
554
"""
555
556
class GSM8K:
557
"""
558
Grade School Math 8K benchmark.
559
"""
560
561
class HumanEval:
562
"""
563
HumanEval benchmark for code generation.
564
"""
565
566
class BigBenchHard:
567
"""
568
Big Bench Hard benchmark.
569
"""
570
```
571
572
[Benchmarks](./benchmarks.md)
573
574
### Tracing
575
576
Component-level observability for evaluating nested LLM components using the `@observe` decorator and trace management.
577
578
```python { .api }
579
def observe(
580
metrics: Optional[List[BaseMetric]] = None,
581
name: Optional[str] = None
582
):
583
"""
584
Decorator for observing function execution and applying metrics.
585
586
Parameters:
587
- metrics (List[BaseMetric], optional): Metrics to apply
588
- name (str, optional): Name for the span
589
"""
590
591
def update_current_span(
592
test_case: Optional[LLMTestCase] = None,
593
**kwargs
594
):
595
"""
596
Updates the current span with additional data.
597
598
Parameters:
599
- test_case (LLMTestCase, optional): Test case data
600
- **kwargs: Additional span attributes
601
"""
602
603
def evaluate_trace(
604
trace_id: str,
605
metrics: List[BaseMetric]
606
):
607
"""
608
Evaluates a specific trace with metrics.
609
"""
610
```
611
612
[Tracing](./tracing.md)
613
614
### Integrations
615
616
Native integrations with popular LLM frameworks for automatic tracing and evaluation.
617
618
```python { .api }
619
# LangChain Integration
620
class CallbackHandler:
621
"""
622
LangChain callback handler for DeepEval tracing.
623
"""
624
625
def tool(func):
626
"""
627
Decorator for marking LangChain tools for tracing.
628
"""
629
630
# LlamaIndex Integration
631
def instrument_llama_index():
632
"""
633
Instruments LlamaIndex for automatic tracing.
634
"""
635
636
# CrewAI Integration
637
def instrument_crewai():
638
"""
639
Instruments CrewAI for automatic tracing.
640
"""
641
642
# PydanticAI Integration
643
def instrument_pydantic_ai():
644
"""
645
Instruments PydanticAI for automatic tracing.
646
"""
647
```
648
649
[Integrations](./integrations.md)
650
651
## Utility Functions
652
653
```python { .api }
654
def login(api_key: str = None):
655
"""
656
Logs into Confident AI with an API key.
657
658
Parameters:
659
- api_key (str, optional): Confident AI API key
660
"""
661
662
def log_hyperparameters(hyperparameters: Dict):
663
"""
664
Logs hyperparameters for the current test run.
665
666
Parameters:
667
- hyperparameters (Dict): Dictionary of hyperparameters to log
668
"""
669
670
def on_test_run_end(callback: Callable):
671
"""
672
Registers a callback to be executed when a test run ends.
673
674
Parameters:
675
- callback (Callable): Function to execute at test run end
676
"""
677
```
678