Tessl Tile for pypi/deepeval@3.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agentic-metrics.md benchmarks.md content-quality-metrics.md conversational-metrics.md core-evaluation.md custom-metrics.md dataset.md index.md integrations.md models.md multimodal-metrics.md rag-metrics.md synthesizer.md test-cases.md tracing.md

rag-metrics.mddocs/

0
# RAG Metrics
1

2
Metrics specifically designed for evaluating Retrieval-Augmented Generation (RAG) systems. These metrics measure answer quality, faithfulness to context, and retrieval effectiveness using LLM-based evaluation.
3

4
## Imports
5

6
```python
7
from deepeval.metrics import (
8
    AnswerRelevancyMetric,
9
    FaithfulnessMetric,
10
    ContextualRecallMetric,
11
    ContextualRelevancyMetric,
12
    ContextualPrecisionMetric
13
)
14
```
15

16
## Capabilities
17

18
### Answer Relevancy Metric
19

20
Measures whether the answer is relevant to the input question. Evaluates if the LLM's response addresses what was asked.
21

22
```python { .api }
23
class AnswerRelevancyMetric:
24
    """
25
    Measures whether the answer is relevant to the input question.
26

27
    Parameters:
28
    - threshold (float): Success threshold (0-1, default: 0.5)
29
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
30
    - include_reason (bool): Include reason in output (default: True)
31
    - async_mode (bool): Async mode (default: True)
32
    - strict_mode (bool): Strict mode (default: False)
33
    - verbose_mode (bool): Verbose mode (default: False)
34
    - evaluation_template (Type[AnswerRelevancyTemplate], optional): Custom evaluation template
35

36
    Required Test Case Parameters:
37
    - INPUT
38
    - ACTUAL_OUTPUT
39

40
    Attributes:
41
    - score (float): Relevancy score (0-1)
42
    - reason (str): Explanation of the score
43
    - success (bool): Whether score meets threshold
44
    - statements (List[str]): Generated statements from actual output
45
    - verdicts (List[AnswerRelevancyVerdict]): Verdicts for each statement
46
    """
47
```
48

49
Usage example:
50

51
```python
52
from deepeval.metrics import AnswerRelevancyMetric
53
from deepeval.test_case import LLMTestCase
54

55
# Create metric
56
metric = AnswerRelevancyMetric(
57
    threshold=0.7,
58
    model="gpt-4",
59
    include_reason=True
60
)
61

62
# Create test case
63
test_case = LLMTestCase(
64
    input="What is the capital of France?",
65
    actual_output="The capital of France is Paris. It's known as the City of Light."
66
)
67

68
# Evaluate
69
metric.measure(test_case)
70

71
print(f"Score: {metric.score}")  # e.g., 0.95
72
print(f"Reason: {metric.reason}")  # Explanation
73
print(f"Success: {metric.success}")  # True if score >= 0.7
74
```
75

76
### Faithfulness Metric
77

78
Measures whether the answer is faithful to the context, detecting hallucinations by checking if all claims in the output are supported by the provided context.
79

80
```python { .api }
81
class FaithfulnessMetric:
82
    """
83
    Measures whether the answer is faithful to the context (no hallucinations).
84

85
    Parameters:
86
    - threshold (float): Success threshold (0-1, default: 0.5)
87
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
88
    - include_reason (bool): Include reason in output (default: True)
89
    - async_mode (bool): Async mode (default: True)
90
    - strict_mode (bool): Strict mode (default: False)
91
    - verbose_mode (bool): Verbose mode (default: False)
92
    - truths_extraction_limit (int, optional): Limit number of truths extracted from context
93
    - penalize_ambiguous_claims (bool): Penalize ambiguous claims (default: False)
94
    - evaluation_template (Type[FaithfulnessTemplate], optional): Custom evaluation template
95

96
    Required Test Case Parameters:
97
    - ACTUAL_OUTPUT
98
    - RETRIEVAL_CONTEXT or CONTEXT
99

100
    Attributes:
101
    - score (float): Faithfulness score (0-1)
102
    - reason (str): Explanation with unfaithful claims if any
103
    - success (bool): Whether score meets threshold
104
    - truths (List[str]): Extracted truths from context
105
    - claims (List[str]): Extracted claims from output
106
    - verdicts (List[FaithfulnessVerdict]): Verdicts for each claim
107
    """
108
```
109

110
Usage example:
111

112
```python
113
from deepeval.metrics import FaithfulnessMetric
114
from deepeval.test_case import LLMTestCase
115

116
# Create metric
117
metric = FaithfulnessMetric(threshold=0.8)
118

119
# Test case with retrieval context
120
test_case = LLMTestCase(
121
    input="What is the refund policy?",
122
    actual_output="We offer a 30-day full refund at no extra cost.",
123
    retrieval_context=[
124
        "All customers are eligible for a 30 day full refund at no extra costs.",
125
        "Refunds are processed within 5-7 business days."
126
    ]
127
)
128

129
# Evaluate faithfulness
130
metric.measure(test_case)
131

132
if metric.success:
133
    print("Output is faithful to context")
134
else:
135
    print(f"Hallucination detected: {metric.reason}")
136
```
137

138
### Contextual Recall Metric
139

140
Measures whether the retrieved context contains all information needed to answer the question. Evaluates the completeness of the retrieval system.
141

142
```python { .api }
143
class ContextualRecallMetric:
144
    """
145
    Measures whether the retrieved context contains all information needed to answer.
146

147
    Parameters:
148
    - threshold (float): Success threshold (0-1, default: 0.5)
149
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
150
    - include_reason (bool): Include reason in output (default: True)
151
    - async_mode (bool): Async mode (default: True)
152
    - strict_mode (bool): Strict mode (default: False)
153
    - verbose_mode (bool): Verbose mode (default: False)
154

155
    Required Test Case Parameters:
156
    - INPUT
157
    - EXPECTED_OUTPUT
158
    - RETRIEVAL_CONTEXT
159

160
    Attributes:
161
    - score (float): Recall score (0-1)
162
    - reason (str): Explanation of what's missing if any
163
    - success (bool): Whether score meets threshold
164
    """
165
```
166

167
Usage example:
168

169
```python
170
from deepeval.metrics import ContextualRecallMetric
171
from deepeval.test_case import LLMTestCase
172

173
# Create metric
174
metric = ContextualRecallMetric(threshold=0.7)
175

176
# Test case with expected output
177
test_case = LLMTestCase(
178
    input="How do I reset my password?",
179
    expected_output="Click 'Forgot Password' on the login page and check your email for reset link.",
180
    retrieval_context=[
181
        "Password reset: Click 'Forgot Password' on login page",
182
        "Reset link sent to registered email address"
183
    ]
184
)
185

186
# Evaluate recall
187
metric.measure(test_case)
188

189
if not metric.success:
190
    print(f"Missing information: {metric.reason}")
191
```
192

193
### Contextual Relevancy Metric
194

195
Measures whether the retrieved context is relevant to the input question. Evaluates the precision of the retrieval system by identifying irrelevant context.
196

197
```python { .api }
198
class ContextualRelevancyMetric:
199
    """
200
    Measures whether the retrieved context is relevant to the input.
201

202
    Parameters:
203
    - threshold (float): Success threshold (0-1, default: 0.5)
204
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
205
    - include_reason (bool): Include reason in output (default: True)
206
    - async_mode (bool): Async mode (default: True)
207
    - strict_mode (bool): Strict mode (default: False)
208
    - verbose_mode (bool): Verbose mode (default: False)
209

210
    Required Test Case Parameters:
211
    - INPUT
212
    - RETRIEVAL_CONTEXT
213

214
    Attributes:
215
    - score (float): Relevancy score (0-1)
216
    - reason (str): Explanation identifying irrelevant context
217
    - success (bool): Whether score meets threshold
218
    """
219
```
220

221
Usage example:
222

223
```python
224
from deepeval.metrics import ContextualRelevancyMetric
225
from deepeval.test_case import LLMTestCase
226

227
# Create metric
228
metric = ContextualRelevancyMetric(threshold=0.7)
229

230
# Test case
231
test_case = LLMTestCase(
232
    input="What are the shipping costs to California?",
233
    retrieval_context=[
234
        "Shipping to California: $5.99 for standard, $12.99 for express",
235
        "California has over 39 million residents",  # Irrelevant
236
        "Free shipping on orders over $50"
237
    ]
238
)
239

240
# Evaluate relevancy
241
metric.measure(test_case)
242

243
if not metric.success:
244
    print(f"Irrelevant context detected: {metric.reason}")
245
```
246

247
### Contextual Precision Metric
248

249
Measures whether relevant context nodes are ranked higher than irrelevant ones in the retrieval context. Evaluates the ranking quality of the retrieval system.
250

251
```python { .api }
252
class ContextualPrecisionMetric:
253
    """
254
    Measures whether relevant context nodes are ranked higher than irrelevant ones.
255

256
    Parameters:
257
    - threshold (float): Success threshold (0-1, default: 0.5)
258
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
259
    - include_reason (bool): Include reason in output (default: True)
260
    - async_mode (bool): Async mode (default: True)
261
    - strict_mode (bool): Strict mode (default: False)
262
    - verbose_mode (bool): Verbose mode (default: False)
263

264
    Required Test Case Parameters:
265
    - INPUT
266
    - EXPECTED_OUTPUT
267
    - RETRIEVAL_CONTEXT (order matters)
268

269
    Attributes:
270
    - score (float): Precision score (0-1)
271
    - reason (str): Explanation of ranking issues
272
    - success (bool): Whether score meets threshold
273
    """
274
```
275

276
Usage example:
277

278
```python
279
from deepeval.metrics import ContextualPrecisionMetric
280
from deepeval.test_case import LLMTestCase
281

282
# Create metric
283
metric = ContextualPrecisionMetric(threshold=0.7)
284

285
# Test case with ordered retrieval context
286
test_case = LLMTestCase(
287
    input="What is the return policy?",
288
    expected_output="30-day return policy with full refund",
289
    retrieval_context=[
290
        "California sales tax rate is 7.25%",  # Irrelevant (ranked too high)
291
        "All products have a 30-day return policy",  # Relevant (should be first)
292
        "Returns are processed within 5 business days"  # Relevant
293
    ]
294
)
295

296
# Evaluate precision
297
metric.measure(test_case)
298

299
if not metric.success:
300
    print(f"Ranking issue: {metric.reason}")
301
```
302

303
## Combined RAG Evaluation
304

305
Evaluate all RAG aspects together:
306

307
```python
308
from deepeval import evaluate
309
from deepeval.metrics import (
310
    AnswerRelevancyMetric,
311
    FaithfulnessMetric,
312
    ContextualRecallMetric,
313
    ContextualRelevancyMetric,
314
    ContextualPrecisionMetric
315
)
316
from deepeval.test_case import LLMTestCase
317

318
# Create comprehensive RAG metrics
319
rag_metrics = [
320
    AnswerRelevancyMetric(threshold=0.7),
321
    FaithfulnessMetric(threshold=0.8),
322
    ContextualRecallMetric(threshold=0.7),
323
    ContextualRelevancyMetric(threshold=0.7),
324
    ContextualPrecisionMetric(threshold=0.7)
325
]
326

327
# Test cases for RAG pipeline
328
test_cases = [
329
    LLMTestCase(
330
        input="What's the shipping policy?",
331
        actual_output=rag_pipeline("What's the shipping policy?"),
332
        expected_output="Free shipping on orders over $50, 3-5 business days",
333
        retrieval_context=get_retrieval_context("What's the shipping policy?")
334
    ),
335
    # ... more test cases
336
]
337

338
# Evaluate entire RAG pipeline
339
result = evaluate(test_cases, rag_metrics)
340

341
# Analyze results by metric type
342
for metric_name in ["Answer Relevancy", "Faithfulness", "Contextual Recall",
343
                    "Contextual Relevancy", "Contextual Precision"]:
344
    scores = [tr.metrics[metric_name].score for tr in result.test_results]
345
    avg_score = sum(scores) / len(scores)
346
    print(f"{metric_name}: {avg_score:.2f}")
347
```
348

349
## Metric Customization
350

351
Customize metrics with specific models and configurations:
352

353
```python
354
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
355
from deepeval.models import GPTModel
356

357
# Use specific model for evaluation
358
custom_model = GPTModel(model="gpt-4-turbo")
359

360
answer_relevancy = AnswerRelevancyMetric(
361
    threshold=0.75,
362
    model=custom_model,
363
    include_reason=True,
364
    strict_mode=True,  # More stringent evaluation
365
    verbose_mode=True  # Print detailed logs
366
)
367

368
faithfulness = FaithfulnessMetric(
369
    threshold=0.85,
370
    model=custom_model
371
)
372

373
# Use in evaluation
374
test_case = LLMTestCase(...)
375
answer_relevancy.measure(test_case)
376
faithfulness.measure(test_case)
377
```
378

379
## RAGAS Composite Score
380

381
While DeepEval provides individual RAG metrics, you can compute a RAGAS-style composite score:
382

383
```python
384
from deepeval import evaluate
385
from deepeval.metrics import (
386
    AnswerRelevancyMetric,
387
    FaithfulnessMetric,
388
    ContextualRecallMetric,
389
    ContextualPrecisionMetric
390
)
391

392
# Evaluate with RAGAS component metrics
393
result = evaluate(test_cases, [
394
    AnswerRelevancyMetric(),
395
    FaithfulnessMetric(),
396
    ContextualRecallMetric(),
397
    ContextualPrecisionMetric()
398
])
399

400
# Compute RAGAS score (harmonic mean of component scores)
401
for test_result in result.test_results:
402
    scores = [m.score for m in test_result.metrics.values()]
403
    ragas_score = len(scores) / sum(1/s for s in scores if s > 0)
404
    print(f"RAGAS Score: {ragas_score:.3f}")
405
```
406

Version

Tile

Files

rag-metrics.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

rag-metrics.mddocs/