0
# RAG Metrics
1
2
Metrics specifically designed for evaluating Retrieval-Augmented Generation (RAG) systems. These metrics measure answer quality, faithfulness to context, and retrieval effectiveness using LLM-based evaluation.
3
4
## Imports
5
6
```python
7
from deepeval.metrics import (
8
AnswerRelevancyMetric,
9
FaithfulnessMetric,
10
ContextualRecallMetric,
11
ContextualRelevancyMetric,
12
ContextualPrecisionMetric
13
)
14
```
15
16
## Capabilities
17
18
### Answer Relevancy Metric
19
20
Measures whether the answer is relevant to the input question. Evaluates if the LLM's response addresses what was asked.
21
22
```python { .api }
23
class AnswerRelevancyMetric:
24
"""
25
Measures whether the answer is relevant to the input question.
26
27
Parameters:
28
- threshold (float): Success threshold (0-1, default: 0.5)
29
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
30
- include_reason (bool): Include reason in output (default: True)
31
- async_mode (bool): Async mode (default: True)
32
- strict_mode (bool): Strict mode (default: False)
33
- verbose_mode (bool): Verbose mode (default: False)
34
- evaluation_template (Type[AnswerRelevancyTemplate], optional): Custom evaluation template
35
36
Required Test Case Parameters:
37
- INPUT
38
- ACTUAL_OUTPUT
39
40
Attributes:
41
- score (float): Relevancy score (0-1)
42
- reason (str): Explanation of the score
43
- success (bool): Whether score meets threshold
44
- statements (List[str]): Generated statements from actual output
45
- verdicts (List[AnswerRelevancyVerdict]): Verdicts for each statement
46
"""
47
```
48
49
Usage example:
50
51
```python
52
from deepeval.metrics import AnswerRelevancyMetric
53
from deepeval.test_case import LLMTestCase
54
55
# Create metric
56
metric = AnswerRelevancyMetric(
57
threshold=0.7,
58
model="gpt-4",
59
include_reason=True
60
)
61
62
# Create test case
63
test_case = LLMTestCase(
64
input="What is the capital of France?",
65
actual_output="The capital of France is Paris. It's known as the City of Light."
66
)
67
68
# Evaluate
69
metric.measure(test_case)
70
71
print(f"Score: {metric.score}") # e.g., 0.95
72
print(f"Reason: {metric.reason}") # Explanation
73
print(f"Success: {metric.success}") # True if score >= 0.7
74
```
75
76
### Faithfulness Metric
77
78
Measures whether the answer is faithful to the context, detecting hallucinations by checking if all claims in the output are supported by the provided context.
79
80
```python { .api }
81
class FaithfulnessMetric:
82
"""
83
Measures whether the answer is faithful to the context (no hallucinations).
84
85
Parameters:
86
- threshold (float): Success threshold (0-1, default: 0.5)
87
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
88
- include_reason (bool): Include reason in output (default: True)
89
- async_mode (bool): Async mode (default: True)
90
- strict_mode (bool): Strict mode (default: False)
91
- verbose_mode (bool): Verbose mode (default: False)
92
- truths_extraction_limit (int, optional): Limit number of truths extracted from context
93
- penalize_ambiguous_claims (bool): Penalize ambiguous claims (default: False)
94
- evaluation_template (Type[FaithfulnessTemplate], optional): Custom evaluation template
95
96
Required Test Case Parameters:
97
- ACTUAL_OUTPUT
98
- RETRIEVAL_CONTEXT or CONTEXT
99
100
Attributes:
101
- score (float): Faithfulness score (0-1)
102
- reason (str): Explanation with unfaithful claims if any
103
- success (bool): Whether score meets threshold
104
- truths (List[str]): Extracted truths from context
105
- claims (List[str]): Extracted claims from output
106
- verdicts (List[FaithfulnessVerdict]): Verdicts for each claim
107
"""
108
```
109
110
Usage example:
111
112
```python
113
from deepeval.metrics import FaithfulnessMetric
114
from deepeval.test_case import LLMTestCase
115
116
# Create metric
117
metric = FaithfulnessMetric(threshold=0.8)
118
119
# Test case with retrieval context
120
test_case = LLMTestCase(
121
input="What is the refund policy?",
122
actual_output="We offer a 30-day full refund at no extra cost.",
123
retrieval_context=[
124
"All customers are eligible for a 30 day full refund at no extra costs.",
125
"Refunds are processed within 5-7 business days."
126
]
127
)
128
129
# Evaluate faithfulness
130
metric.measure(test_case)
131
132
if metric.success:
133
print("Output is faithful to context")
134
else:
135
print(f"Hallucination detected: {metric.reason}")
136
```
137
138
### Contextual Recall Metric
139
140
Measures whether the retrieved context contains all information needed to answer the question. Evaluates the completeness of the retrieval system.
141
142
```python { .api }
143
class ContextualRecallMetric:
144
"""
145
Measures whether the retrieved context contains all information needed to answer.
146
147
Parameters:
148
- threshold (float): Success threshold (0-1, default: 0.5)
149
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
150
- include_reason (bool): Include reason in output (default: True)
151
- async_mode (bool): Async mode (default: True)
152
- strict_mode (bool): Strict mode (default: False)
153
- verbose_mode (bool): Verbose mode (default: False)
154
155
Required Test Case Parameters:
156
- INPUT
157
- EXPECTED_OUTPUT
158
- RETRIEVAL_CONTEXT
159
160
Attributes:
161
- score (float): Recall score (0-1)
162
- reason (str): Explanation of what's missing if any
163
- success (bool): Whether score meets threshold
164
"""
165
```
166
167
Usage example:
168
169
```python
170
from deepeval.metrics import ContextualRecallMetric
171
from deepeval.test_case import LLMTestCase
172
173
# Create metric
174
metric = ContextualRecallMetric(threshold=0.7)
175
176
# Test case with expected output
177
test_case = LLMTestCase(
178
input="How do I reset my password?",
179
expected_output="Click 'Forgot Password' on the login page and check your email for reset link.",
180
retrieval_context=[
181
"Password reset: Click 'Forgot Password' on login page",
182
"Reset link sent to registered email address"
183
]
184
)
185
186
# Evaluate recall
187
metric.measure(test_case)
188
189
if not metric.success:
190
print(f"Missing information: {metric.reason}")
191
```
192
193
### Contextual Relevancy Metric
194
195
Measures whether the retrieved context is relevant to the input question. Evaluates the precision of the retrieval system by identifying irrelevant context.
196
197
```python { .api }
198
class ContextualRelevancyMetric:
199
"""
200
Measures whether the retrieved context is relevant to the input.
201
202
Parameters:
203
- threshold (float): Success threshold (0-1, default: 0.5)
204
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
205
- include_reason (bool): Include reason in output (default: True)
206
- async_mode (bool): Async mode (default: True)
207
- strict_mode (bool): Strict mode (default: False)
208
- verbose_mode (bool): Verbose mode (default: False)
209
210
Required Test Case Parameters:
211
- INPUT
212
- RETRIEVAL_CONTEXT
213
214
Attributes:
215
- score (float): Relevancy score (0-1)
216
- reason (str): Explanation identifying irrelevant context
217
- success (bool): Whether score meets threshold
218
"""
219
```
220
221
Usage example:
222
223
```python
224
from deepeval.metrics import ContextualRelevancyMetric
225
from deepeval.test_case import LLMTestCase
226
227
# Create metric
228
metric = ContextualRelevancyMetric(threshold=0.7)
229
230
# Test case
231
test_case = LLMTestCase(
232
input="What are the shipping costs to California?",
233
retrieval_context=[
234
"Shipping to California: $5.99 for standard, $12.99 for express",
235
"California has over 39 million residents", # Irrelevant
236
"Free shipping on orders over $50"
237
]
238
)
239
240
# Evaluate relevancy
241
metric.measure(test_case)
242
243
if not metric.success:
244
print(f"Irrelevant context detected: {metric.reason}")
245
```
246
247
### Contextual Precision Metric
248
249
Measures whether relevant context nodes are ranked higher than irrelevant ones in the retrieval context. Evaluates the ranking quality of the retrieval system.
250
251
```python { .api }
252
class ContextualPrecisionMetric:
253
"""
254
Measures whether relevant context nodes are ranked higher than irrelevant ones.
255
256
Parameters:
257
- threshold (float): Success threshold (0-1, default: 0.5)
258
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
259
- include_reason (bool): Include reason in output (default: True)
260
- async_mode (bool): Async mode (default: True)
261
- strict_mode (bool): Strict mode (default: False)
262
- verbose_mode (bool): Verbose mode (default: False)
263
264
Required Test Case Parameters:
265
- INPUT
266
- EXPECTED_OUTPUT
267
- RETRIEVAL_CONTEXT (order matters)
268
269
Attributes:
270
- score (float): Precision score (0-1)
271
- reason (str): Explanation of ranking issues
272
- success (bool): Whether score meets threshold
273
"""
274
```
275
276
Usage example:
277
278
```python
279
from deepeval.metrics import ContextualPrecisionMetric
280
from deepeval.test_case import LLMTestCase
281
282
# Create metric
283
metric = ContextualPrecisionMetric(threshold=0.7)
284
285
# Test case with ordered retrieval context
286
test_case = LLMTestCase(
287
input="What is the return policy?",
288
expected_output="30-day return policy with full refund",
289
retrieval_context=[
290
"California sales tax rate is 7.25%", # Irrelevant (ranked too high)
291
"All products have a 30-day return policy", # Relevant (should be first)
292
"Returns are processed within 5 business days" # Relevant
293
]
294
)
295
296
# Evaluate precision
297
metric.measure(test_case)
298
299
if not metric.success:
300
print(f"Ranking issue: {metric.reason}")
301
```
302
303
## Combined RAG Evaluation
304
305
Evaluate all RAG aspects together:
306
307
```python
308
from deepeval import evaluate
309
from deepeval.metrics import (
310
AnswerRelevancyMetric,
311
FaithfulnessMetric,
312
ContextualRecallMetric,
313
ContextualRelevancyMetric,
314
ContextualPrecisionMetric
315
)
316
from deepeval.test_case import LLMTestCase
317
318
# Create comprehensive RAG metrics
319
rag_metrics = [
320
AnswerRelevancyMetric(threshold=0.7),
321
FaithfulnessMetric(threshold=0.8),
322
ContextualRecallMetric(threshold=0.7),
323
ContextualRelevancyMetric(threshold=0.7),
324
ContextualPrecisionMetric(threshold=0.7)
325
]
326
327
# Test cases for RAG pipeline
328
test_cases = [
329
LLMTestCase(
330
input="What's the shipping policy?",
331
actual_output=rag_pipeline("What's the shipping policy?"),
332
expected_output="Free shipping on orders over $50, 3-5 business days",
333
retrieval_context=get_retrieval_context("What's the shipping policy?")
334
),
335
# ... more test cases
336
]
337
338
# Evaluate entire RAG pipeline
339
result = evaluate(test_cases, rag_metrics)
340
341
# Analyze results by metric type
342
for metric_name in ["Answer Relevancy", "Faithfulness", "Contextual Recall",
343
"Contextual Relevancy", "Contextual Precision"]:
344
scores = [tr.metrics[metric_name].score for tr in result.test_results]
345
avg_score = sum(scores) / len(scores)
346
print(f"{metric_name}: {avg_score:.2f}")
347
```
348
349
## Metric Customization
350
351
Customize metrics with specific models and configurations:
352
353
```python
354
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
355
from deepeval.models import GPTModel
356
357
# Use specific model for evaluation
358
custom_model = GPTModel(model="gpt-4-turbo")
359
360
answer_relevancy = AnswerRelevancyMetric(
361
threshold=0.75,
362
model=custom_model,
363
include_reason=True,
364
strict_mode=True, # More stringent evaluation
365
verbose_mode=True # Print detailed logs
366
)
367
368
faithfulness = FaithfulnessMetric(
369
threshold=0.85,
370
model=custom_model
371
)
372
373
# Use in evaluation
374
test_case = LLMTestCase(...)
375
answer_relevancy.measure(test_case)
376
faithfulness.measure(test_case)
377
```
378
379
## RAGAS Composite Score
380
381
While DeepEval provides individual RAG metrics, you can compute a RAGAS-style composite score:
382
383
```python
384
from deepeval import evaluate
385
from deepeval.metrics import (
386
AnswerRelevancyMetric,
387
FaithfulnessMetric,
388
ContextualRecallMetric,
389
ContextualPrecisionMetric
390
)
391
392
# Evaluate with RAGAS component metrics
393
result = evaluate(test_cases, [
394
AnswerRelevancyMetric(),
395
FaithfulnessMetric(),
396
ContextualRecallMetric(),
397
ContextualPrecisionMetric()
398
])
399
400
# Compute RAGAS score (harmonic mean of component scores)
401
for test_result in result.test_results:
402
scores = [m.score for m in test_result.metrics.values()]
403
ragas_score = len(scores) / sum(1/s for s in scores if s > 0)
404
print(f"RAGAS Score: {ragas_score:.3f}")
405
```
406