Tessl Tile for pypi/deepeval@3.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agentic-metrics.md benchmarks.md content-quality-metrics.md conversational-metrics.md core-evaluation.md custom-metrics.md dataset.md index.md integrations.md models.md multimodal-metrics.md rag-metrics.md synthesizer.md test-cases.md tracing.md

index.mddocs/

0
# DeepEval
1

2
A comprehensive Python framework for evaluating and testing large language model (LLM) systems. DeepEval provides 50+ research-backed metrics for evaluating RAG pipelines, chatbots, AI agents, and other LLM applications. It operates like Pytest but specialized for LLM evaluation, supporting both end-to-end and component-level testing.
3

4
## Package Information
5

6
- **Package Name**: deepeval
7
- **Language**: Python
8
- **Installation**: `pip install -U deepeval`
9
- **Minimum Python Version**: 3.9+
10

11
## Core Imports
12

13
```python
14
import deepeval
15
```
16

17
Common imports for evaluation:
18

19
```python
20
from deepeval import evaluate, assert_test
21
from deepeval.test_case import LLMTestCase, ConversationalTestCase
22
from deepeval.metrics import GEval, AnswerRelevancyMetric, FaithfulnessMetric
23
from deepeval.dataset import EvaluationDataset, Golden
24
```
25

26
## Basic Usage
27

28
```python
29
from deepeval import assert_test
30
from deepeval.metrics import AnswerRelevancyMetric
31
from deepeval.test_case import LLMTestCase
32

33
# Create a test case
34
test_case = LLMTestCase(
35
    input="What if these shoes don't fit?",
36
    actual_output="You have 30 days to get a full refund at no extra cost.",
37
    expected_output="We offer a 30-day full refund at no extra costs.",
38
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
39
)
40

41
# Create and run a metric
42
metric = AnswerRelevancyMetric(threshold=0.7)
43
assert_test(test_case, [metric])
44
```
45

46
Evaluating multiple test cases:
47

48
```python
49
from deepeval import evaluate
50
from deepeval.dataset import EvaluationDataset, Golden
51
from deepeval.metrics import FaithfulnessMetric
52

53
# Create a dataset
54
dataset = EvaluationDataset(
55
    goldens=[
56
        Golden(input="What's the refund policy?", expected_output="30-day full refund"),
57
        Golden(input="How do I return items?", expected_output="Contact support for return label")
58
    ]
59
)
60

61
# Generate test cases and evaluate
62
for golden in dataset.goldens:
63
    test_case = LLMTestCase(
64
        input=golden.input,
65
        actual_output=your_llm_app(golden.input),
66
        expected_output=golden.expected_output
67
    )
68
    dataset.add_test_case(test_case)
69

70
# Evaluate entire dataset
71
evaluate(dataset, [FaithfulnessMetric()])
72
```
73

74
## Architecture
75

76
DeepEval's architecture consists of several key layers:
77

78
- **Test Cases**: Structured containers for inputs, outputs, and context (`LLMTestCase`, `ConversationalTestCase`, `MLLMTestCase`)
79
- **Metrics**: Evaluation criteria powered by LLMs, statistical methods, or NLP models (50+ built-in metrics)
80
- **Datasets**: Collections of test cases and "golden" examples for batch evaluation
81
- **Synthesizer**: Generates synthetic test data using various evolution strategies
82
- **Tracing**: Component-level observability with the `@observe` decorator for nested evaluations
83
- **Models**: Abstraction layer supporting 15+ LLM providers (OpenAI, Anthropic, local models, etc.)
84
- **Integrations**: Native support for LangChain, LlamaIndex, CrewAI, and PydanticAI
85

86
## Capabilities
87

88
### Test Cases
89

90
Test cases are structured containers representing LLM interactions to be evaluated. DeepEval supports standard LLM tests, multi-turn conversations, multimodal inputs, and arena-style comparisons.
91

92
```python { .api }
93
class LLMTestCase:
94
    """
95
    Represents a test case for evaluating LLM outputs.
96

97
    Parameters:
98
    - input (str): Input prompt to the LLM
99
    - actual_output (str, optional): Actual output from the LLM
100
    - expected_output (str, optional): Expected output
101
    - context (List[str], optional): Context information
102
    - retrieval_context (List[str], optional): Retrieved context for RAG
103
    - additional_metadata (Dict, optional): Additional metadata
104
    - tools_called (List[ToolCall], optional): Tools called by the LLM
105
    - expected_tools (List[ToolCall], optional): Expected tools to be called
106
    - comments (str, optional): Comments about the test case
107
    - name (str, optional): Name of the test case
108
    - tags (List[str], optional): Tags for organization
109
    """
110

111
class ConversationalTestCase:
112
    """
113
    Represents a multi-turn conversational test case.
114

115
    Parameters:
116
    - turns (List[Turn]): List of conversation turns
117
    - scenario (str, optional): Scenario description
118
    - context (List[str], optional): Context information
119
    - expected_outcome (str, optional): Expected outcome
120
    - name (str, optional): Name of the test case
121
    """
122

123
class MLLMTestCase:
124
    """
125
    Represents a test case for multimodal LLMs (text + images).
126

127
    Parameters:
128
    - input (List[Union[str, MLLMImage]]): Input with text and images
129
    - actual_output (List[Union[str, MLLMImage]]): Actual output
130
    - expected_output (List[Union[str, MLLMImage]], optional): Expected output
131
    - context (List[Union[str, MLLMImage]], optional): Context
132
    """
133
```
134

135
[Test Cases](./test-cases.md)
136

137
### Core Evaluation
138

139
Core evaluation functions for running metrics against test cases, either individually or in batch. Supports pytest integration, standalone evaluation, and comparison between models.
140

141
```python { .api }
142
def evaluate(
143
    test_cases: Union[List[LLMTestCase], List[ConversationalTestCase], List[MLLMTestCase]],
144
    metrics: Optional[Union[List[BaseMetric], List[BaseConversationalMetric], List[BaseMultimodalMetric]]] = None,
145
    hyperparameters: Optional[Dict[str, Union[str, int, float]]] = None,
146
    identifier: Optional[str] = None,
147
    async_config: Optional[AsyncConfig] = None,
148
    display_config: Optional[DisplayConfig] = None,
149
    cache_config: Optional[CacheConfig] = None,
150
    error_config: Optional[ErrorConfig] = None
151
) -> EvaluationResult:
152
    """
153
    Evaluates test cases against specified metrics.
154

155
    Returns:
156
    - EvaluationResult: Contains test results, Confident AI link, and test run ID
157
    """
158

159
def assert_test(
160
    test_case: Optional[Union[LLMTestCase, ConversationalTestCase, MLLMTestCase]],
161
    metrics: Optional[Union[List[BaseMetric], List[BaseConversationalMetric], List[BaseMultimodalMetric]]] = None,
162
    run_async: bool = True
163
):
164
    """
165
    Asserts that a single test case passes specified metrics.
166

167
    Raises:
168
    - AssertionError: If metrics fail
169
    """
170

171
def compare(
172
    test_cases: List[List[LLMTestCase]],
173
    metrics: List[BaseMetric]
174
) -> ComparisonResult:
175
    """
176
    Compares multiple test results to determine which performs better.
177
    """
178
```
179

180
[Core Evaluation](./core-evaluation.md)
181

182
### RAG Metrics
183

184
Metrics specifically designed for evaluating Retrieval-Augmented Generation (RAG) systems, measuring answer quality, faithfulness to context, and retrieval effectiveness.
185

186
```python { .api }
187
class AnswerRelevancyMetric:
188
    """
189
    Measures whether the answer is relevant to the input question.
190

191
    Parameters:
192
    - threshold (float): Success threshold (default: 0.5)
193
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
194
    - include_reason (bool): Include reason in output (default: True)
195
    """
196

197
class FaithfulnessMetric:
198
    """
199
    Measures whether the answer is faithful to the context (no hallucinations).
200
    """
201

202
class ContextualRecallMetric:
203
    """
204
    Measures whether the retrieved context contains all information needed.
205
    """
206

207
class ContextualRelevancyMetric:
208
    """
209
    Measures whether the retrieved context is relevant to the input.
210
    """
211

212
class ContextualPrecisionMetric:
213
    """
214
    Measures whether relevant context nodes are ranked higher than irrelevant ones.
215
    """
216
```
217

218
[RAG Metrics](./rag-metrics.md)
219

220
### Content Quality Metrics
221

222
Metrics for evaluating content safety, quality, and compliance, detecting issues like hallucinations, bias, toxicity, and PII leakage.
223

224
```python { .api }
225
class HallucinationMetric:
226
    """
227
    Detects hallucinations in the output.
228

229
    Parameters:
230
    - threshold (float): Success threshold
231
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
232
    """
233

234
class BiasMetric:
235
    """
236
    Detects bias in the output.
237
    """
238

239
class ToxicityMetric:
240
    """
241
    Detects toxic content in the output.
242
    """
243

244
class SummarizationMetric:
245
    """
246
    Evaluates the quality of summaries.
247
    """
248

249
class PIILeakageMetric:
250
    """
251
    Detects personally identifiable information (PII) leakage.
252
    """
253
```
254

255
[Content Quality Metrics](./content-quality-metrics.md)
256

257
### Agentic Metrics
258

259
Metrics for evaluating AI agents, including tool usage, task completion, plan quality, and goal achievement.
260

261
```python { .api }
262
class ToolCorrectnessMetric:
263
    """
264
    Evaluates whether the correct tools were called with correct parameters.
265

266
    Parameters:
267
    - threshold (float): Success threshold
268
    """
269

270
class TaskCompletionMetric:
271
    """
272
    Evaluates whether the task was completed successfully.
273
    """
274

275
class ToolUseMetric:
276
    """
277
    Evaluates appropriate use of available tools.
278
    """
279

280
class PlanQualityMetric:
281
    """
282
    Evaluates the quality of generated plans.
283
    """
284

285
class GoalAccuracyMetric:
286
    """
287
    Measures accuracy in achieving specified goals.
288
    """
289
```
290

291
[Agentic Metrics](./agentic-metrics.md)
292

293
### Conversational Metrics
294

295
Metrics designed for evaluating multi-turn conversations, measuring relevancy, completeness, and role adherence.
296

297
```python { .api }
298
class ConversationalGEval:
299
    """
300
    G-Eval for conversational test cases.
301

302
    Parameters:
303
    - name (str): Name of the metric
304
    - criteria (str): Evaluation criteria
305
    - evaluation_params (List[TurnParams]): Parameters to evaluate
306
    - threshold (float): Success threshold
307
    """
308

309
class TurnRelevancyMetric:
310
    """
311
    Measures relevancy of conversation turns.
312
    """
313

314
class ConversationCompletenessMetric:
315
    """
316
    Evaluates completeness of conversations.
317
    """
318

319
class RoleAdherenceMetric:
320
    """
321
    Measures adherence to assigned role in conversations.
322
    """
323
```
324

325
[Conversational Metrics](./conversational-metrics.md)
326

327
### Multimodal Metrics
328

329
Metrics for evaluating multimodal LLM outputs involving text and images, including generation quality and contextual understanding.
330

331
```python { .api }
332
class MultimodalGEval:
333
    """
334
    G-Eval for multimodal test cases.
335

336
    Parameters:
337
    - name (str): Name of the metric
338
    - criteria (str): Evaluation criteria
339
    - evaluation_params (List[MLLMTestCaseParams]): Parameters to evaluate
340
    """
341

342
class TextToImageMetric:
343
    """
344
    Evaluates text-to-image generation quality.
345
    """
346

347
class ImageCoherenceMetric:
348
    """
349
    Evaluates coherence of images in context.
350
    """
351

352
class MultimodalAnswerRelevancyMetric:
353
    """
354
    Answer relevancy for multimodal inputs.
355
    """
356

357
class MultimodalFaithfulnessMetric:
358
    """
359
    Faithfulness for multimodal outputs.
360
    """
361
```
362

363
[Multimodal Metrics](./multimodal-metrics.md)
364

365
### Custom Metrics
366

367
Framework for creating custom evaluation metrics using G-Eval, DAG (Deep Acyclic Graph), or by extending base metric classes.
368

369
```python { .api }
370
class GEval:
371
    """
372
    Customizable metric based on the G-Eval framework for LLM evaluation.
373

374
    Parameters:
375
    - name (str): Name of the metric
376
    - evaluation_params (List[LLMTestCaseParams]): Parameters to evaluate
377
    - criteria (str, optional): Evaluation criteria
378
    - evaluation_steps (List[str], optional): Steps for evaluation
379
    - rubric (List[Rubric], optional): Scoring rubric
380
    - threshold (float): Success threshold (default: 0.5)
381
    """
382

383
class DAGMetric:
384
    """
385
    Deep Acyclic Graph metric for evaluating structured reasoning.
386

387
    Parameters:
388
    - name (str): Name of the metric
389
    - dag (DeepAcyclicGraph): DAG structure for evaluation
390
    - threshold (float): Success threshold
391
    """
392

393
class BaseMetric:
394
    """
395
    Base class for all LLM test case metrics.
396

397
    Abstract Methods:
398
    - measure(test_case: LLMTestCase) -> float
399
    - a_measure(test_case: LLMTestCase) -> float
400
    - is_successful() -> bool
401
    """
402
```
403

404
[Custom Metrics](./custom-metrics.md)
405

406
### Datasets
407

408
Tools for managing collections of test cases and "golden" examples, supporting batch evaluation, synthetic data generation, and dataset persistence.
409

410
```python { .api }
411
class EvaluationDataset:
412
    """
413
    Manages collections of test cases and goldens for evaluation.
414

415
    Parameters:
416
    - goldens (Union[List[Golden], List[ConversationalGolden]]): Initial goldens
417

418
    Methods:
419
    - add_test_case(test_case): Add a test case
420
    - add_golden(golden): Add a golden
421
    - generate_goldens_from_docs(document_paths, ...): Generate goldens from documents
422
    - evaluate(metrics): Evaluate with metrics
423
    - push(alias): Push to Confident AI
424
    - pull(alias): Pull from Confident AI
425
    """
426

427
class Golden:
428
    """
429
    Represents a "golden" test case - expected input/output pairs.
430

431
    Parameters:
432
    - input (str): Input prompt
433
    - expected_output (str, optional): Expected output
434
    - context (List[str], optional): Context
435
    - retrieval_context (List[str], optional): Retrieved context
436
    """
437

438
class ConversationalGolden:
439
    """
440
    Represents a "golden" conversational test case.
441

442
    Parameters:
443
    - scenario (str): Scenario description
444
    - expected_outcome (str, optional): Expected outcome
445
    - turns (List[Turn], optional): Conversation turns
446
    """
447
```
448

449
[Datasets](./dataset.md)
450

451
### Models
452

453
Model abstraction layer supporting 15+ LLM providers, multimodal models, and embedding models with a unified interface.
454

455
```python { .api }
456
class DeepEvalBaseLLM:
457
    """
458
    Base class for LLM integrations.
459

460
    Abstract Methods:
461
    - generate(prompt: str) -> str
462
    - a_generate(prompt: str) -> str
463
    - get_model_name() -> str
464
    """
465

466
class GPTModel:
467
    """
468
    OpenAI GPT model integration.
469

470
    Parameters:
471
    - model (str): Model name (e.g., "gpt-4", "gpt-3.5-turbo")
472
    - api_key (str, optional): OpenAI API key
473
    """
474

475
class AnthropicModel:
476
    """
477
    Anthropic Claude integration.
478
    """
479

480
class GeminiModel:
481
    """
482
    Google Gemini integration.
483
    """
484

485
class OllamaModel:
486
    """
487
    Ollama model integration for local models.
488
    """
489

490
class DeepEvalBaseMLLM:
491
    """
492
    Base class for multimodal LLM integrations.
493
    """
494
```
495

496
[Models](./models.md)
497

498
### Synthesizer
499

500
Synthetic test data generation using various evolution strategies (reasoning, multi-context, concretizing, etc.) to create diverse and challenging test cases.
501

502
```python { .api }
503
class Synthesizer:
504
    """
505
    Generates synthetic test data and goldens.
506

507
    Parameters:
508
    - model (Union[str, DeepEvalBaseLLM], optional): Model for generation
509
    - async_mode (bool): Async mode (default: True)
510
    - filtration_config (FiltrationConfig, optional): Filtration configuration
511
    - evolution_config (EvolutionConfig, optional): Evolution configuration
512
    - styling_config (StylingConfig, optional): Styling configuration
513

514
    Methods:
515
    - generate_goldens_from_docs(document_paths, ...) -> List[Golden]
516
    - generate_goldens_from_contexts(contexts, ...) -> List[Golden]
517
    - generate_goldens_from_scratch(num_goldens, ...) -> List[Golden]
518
    - save_as(file_type, directory, ...)
519
    """
520

521
class Evolution:
522
    """
523
    Enum of input evolution strategies.
524

525
    Values:
526
    - REASONING: Add reasoning complexity
527
    - MULTICONTEXT: Require multiple contexts
528
    - CONCRETIZING: Make more concrete
529
    - CONSTRAINED: Add constraints
530
    - COMPARATIVE: Add comparisons
531
    - HYPOTHETICAL: Make hypothetical
532
    """
533
```
534

535
[Synthesizer](./synthesizer.md)
536

537
### Benchmarks
538

539
Pre-built benchmarks for evaluating LLMs on standard datasets like MMLU, HellaSwag, GSM8K, HumanEval, and more.
540

541
```python { .api }
542
class MMLU:
543
    """
544
    Massive Multitask Language Understanding benchmark.
545

546
    Parameters:
547
    - tasks (List[MMLUTask], optional): Specific tasks to evaluate
548
    - n_shots (int): Number of few-shot examples
549
    """
550

551
class HellaSwag:
552
    """
553
    HellaSwag benchmark for commonsense reasoning.
554
    """
555

556
class GSM8K:
557
    """
558
    Grade School Math 8K benchmark.
559
    """
560

561
class HumanEval:
562
    """
563
    HumanEval benchmark for code generation.
564
    """
565

566
class BigBenchHard:
567
    """
568
    Big Bench Hard benchmark.
569
    """
570
```
571

572
[Benchmarks](./benchmarks.md)
573

574
### Tracing
575

576
Component-level observability for evaluating nested LLM components using the `@observe` decorator and trace management.
577

578
```python { .api }
579
def observe(
580
    metrics: Optional[List[BaseMetric]] = None,
581
    name: Optional[str] = None
582
):
583
    """
584
    Decorator for observing function execution and applying metrics.
585

586
    Parameters:
587
    - metrics (List[BaseMetric], optional): Metrics to apply
588
    - name (str, optional): Name for the span
589
    """
590

591
def update_current_span(
592
    test_case: Optional[LLMTestCase] = None,
593
    **kwargs
594
):
595
    """
596
    Updates the current span with additional data.
597

598
    Parameters:
599
    - test_case (LLMTestCase, optional): Test case data
600
    - **kwargs: Additional span attributes
601
    """
602

603
def evaluate_trace(
604
    trace_id: str,
605
    metrics: List[BaseMetric]
606
):
607
    """
608
    Evaluates a specific trace with metrics.
609
    """
610
```
611

612
[Tracing](./tracing.md)
613

614
### Integrations
615

616
Native integrations with popular LLM frameworks for automatic tracing and evaluation.
617

618
```python { .api }
619
# LangChain Integration
620
class CallbackHandler:
621
    """
622
    LangChain callback handler for DeepEval tracing.
623
    """
624

625
def tool(func):
626
    """
627
    Decorator for marking LangChain tools for tracing.
628
    """
629

630
# LlamaIndex Integration
631
def instrument_llama_index():
632
    """
633
    Instruments LlamaIndex for automatic tracing.
634
    """
635

636
# CrewAI Integration
637
def instrument_crewai():
638
    """
639
    Instruments CrewAI for automatic tracing.
640
    """
641

642
# PydanticAI Integration
643
def instrument_pydantic_ai():
644
    """
645
    Instruments PydanticAI for automatic tracing.
646
    """
647
```
648

649
[Integrations](./integrations.md)
650

651
## Utility Functions
652

653
```python { .api }
654
def login(api_key: str = None):
655
    """
656
    Logs into Confident AI with an API key.
657

658
    Parameters:
659
    - api_key (str, optional): Confident AI API key
660
    """
661

662
def log_hyperparameters(hyperparameters: Dict):
663
    """
664
    Logs hyperparameters for the current test run.
665

666
    Parameters:
667
    - hyperparameters (Dict): Dictionary of hyperparameters to log
668
    """
669

670
def on_test_run_end(callback: Callable):
671
    """
672
    Registers a callback to be executed when a test run ends.
673

674
    Parameters:
675
    - callback (Callable): Function to execute at test run end
676
    """
677
```
678

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/