or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agentic-metrics.mdbenchmarks.mdcontent-quality-metrics.mdconversational-metrics.mdcore-evaluation.mdcustom-metrics.mddataset.mdindex.mdintegrations.mdmodels.mdmultimodal-metrics.mdrag-metrics.mdsynthesizer.mdtest-cases.mdtracing.md

rag-metrics.mddocs/

0

# RAG Metrics

1

2

Metrics specifically designed for evaluating Retrieval-Augmented Generation (RAG) systems. These metrics measure answer quality, faithfulness to context, and retrieval effectiveness using LLM-based evaluation.

3

4

## Imports

5

6

```python

7

from deepeval.metrics import (

8

AnswerRelevancyMetric,

9

FaithfulnessMetric,

10

ContextualRecallMetric,

11

ContextualRelevancyMetric,

12

ContextualPrecisionMetric

13

)

14

```

15

16

## Capabilities

17

18

### Answer Relevancy Metric

19

20

Measures whether the answer is relevant to the input question. Evaluates if the LLM's response addresses what was asked.

21

22

```python { .api }

23

class AnswerRelevancyMetric:

24

"""

25

Measures whether the answer is relevant to the input question.

26

27

Parameters:

28

- threshold (float): Success threshold (0-1, default: 0.5)

29

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

30

- include_reason (bool): Include reason in output (default: True)

31

- async_mode (bool): Async mode (default: True)

32

- strict_mode (bool): Strict mode (default: False)

33

- verbose_mode (bool): Verbose mode (default: False)

34

- evaluation_template (Type[AnswerRelevancyTemplate], optional): Custom evaluation template

35

36

Required Test Case Parameters:

37

- INPUT

38

- ACTUAL_OUTPUT

39

40

Attributes:

41

- score (float): Relevancy score (0-1)

42

- reason (str): Explanation of the score

43

- success (bool): Whether score meets threshold

44

- statements (List[str]): Generated statements from actual output

45

- verdicts (List[AnswerRelevancyVerdict]): Verdicts for each statement

46

"""

47

```

48

49

Usage example:

50

51

```python

52

from deepeval.metrics import AnswerRelevancyMetric

53

from deepeval.test_case import LLMTestCase

54

55

# Create metric

56

metric = AnswerRelevancyMetric(

57

threshold=0.7,

58

model="gpt-4",

59

include_reason=True

60

)

61

62

# Create test case

63

test_case = LLMTestCase(

64

input="What is the capital of France?",

65

actual_output="The capital of France is Paris. It's known as the City of Light."

66

)

67

68

# Evaluate

69

metric.measure(test_case)

70

71

print(f"Score: {metric.score}") # e.g., 0.95

72

print(f"Reason: {metric.reason}") # Explanation

73

print(f"Success: {metric.success}") # True if score >= 0.7

74

```

75

76

### Faithfulness Metric

77

78

Measures whether the answer is faithful to the context, detecting hallucinations by checking if all claims in the output are supported by the provided context.

79

80

```python { .api }

81

class FaithfulnessMetric:

82

"""

83

Measures whether the answer is faithful to the context (no hallucinations).

84

85

Parameters:

86

- threshold (float): Success threshold (0-1, default: 0.5)

87

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

88

- include_reason (bool): Include reason in output (default: True)

89

- async_mode (bool): Async mode (default: True)

90

- strict_mode (bool): Strict mode (default: False)

91

- verbose_mode (bool): Verbose mode (default: False)

92

- truths_extraction_limit (int, optional): Limit number of truths extracted from context

93

- penalize_ambiguous_claims (bool): Penalize ambiguous claims (default: False)

94

- evaluation_template (Type[FaithfulnessTemplate], optional): Custom evaluation template

95

96

Required Test Case Parameters:

97

- ACTUAL_OUTPUT

98

- RETRIEVAL_CONTEXT or CONTEXT

99

100

Attributes:

101

- score (float): Faithfulness score (0-1)

102

- reason (str): Explanation with unfaithful claims if any

103

- success (bool): Whether score meets threshold

104

- truths (List[str]): Extracted truths from context

105

- claims (List[str]): Extracted claims from output

106

- verdicts (List[FaithfulnessVerdict]): Verdicts for each claim

107

"""

108

```

109

110

Usage example:

111

112

```python

113

from deepeval.metrics import FaithfulnessMetric

114

from deepeval.test_case import LLMTestCase

115

116

# Create metric

117

metric = FaithfulnessMetric(threshold=0.8)

118

119

# Test case with retrieval context

120

test_case = LLMTestCase(

121

input="What is the refund policy?",

122

actual_output="We offer a 30-day full refund at no extra cost.",

123

retrieval_context=[

124

"All customers are eligible for a 30 day full refund at no extra costs.",

125

"Refunds are processed within 5-7 business days."

126

]

127

)

128

129

# Evaluate faithfulness

130

metric.measure(test_case)

131

132

if metric.success:

133

print("Output is faithful to context")

134

else:

135

print(f"Hallucination detected: {metric.reason}")

136

```

137

138

### Contextual Recall Metric

139

140

Measures whether the retrieved context contains all information needed to answer the question. Evaluates the completeness of the retrieval system.

141

142

```python { .api }

143

class ContextualRecallMetric:

144

"""

145

Measures whether the retrieved context contains all information needed to answer.

146

147

Parameters:

148

- threshold (float): Success threshold (0-1, default: 0.5)

149

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

150

- include_reason (bool): Include reason in output (default: True)

151

- async_mode (bool): Async mode (default: True)

152

- strict_mode (bool): Strict mode (default: False)

153

- verbose_mode (bool): Verbose mode (default: False)

154

155

Required Test Case Parameters:

156

- INPUT

157

- EXPECTED_OUTPUT

158

- RETRIEVAL_CONTEXT

159

160

Attributes:

161

- score (float): Recall score (0-1)

162

- reason (str): Explanation of what's missing if any

163

- success (bool): Whether score meets threshold

164

"""

165

```

166

167

Usage example:

168

169

```python

170

from deepeval.metrics import ContextualRecallMetric

171

from deepeval.test_case import LLMTestCase

172

173

# Create metric

174

metric = ContextualRecallMetric(threshold=0.7)

175

176

# Test case with expected output

177

test_case = LLMTestCase(

178

input="How do I reset my password?",

179

expected_output="Click 'Forgot Password' on the login page and check your email for reset link.",

180

retrieval_context=[

181

"Password reset: Click 'Forgot Password' on login page",

182

"Reset link sent to registered email address"

183

]

184

)

185

186

# Evaluate recall

187

metric.measure(test_case)

188

189

if not metric.success:

190

print(f"Missing information: {metric.reason}")

191

```

192

193

### Contextual Relevancy Metric

194

195

Measures whether the retrieved context is relevant to the input question. Evaluates the precision of the retrieval system by identifying irrelevant context.

196

197

```python { .api }

198

class ContextualRelevancyMetric:

199

"""

200

Measures whether the retrieved context is relevant to the input.

201

202

Parameters:

203

- threshold (float): Success threshold (0-1, default: 0.5)

204

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

205

- include_reason (bool): Include reason in output (default: True)

206

- async_mode (bool): Async mode (default: True)

207

- strict_mode (bool): Strict mode (default: False)

208

- verbose_mode (bool): Verbose mode (default: False)

209

210

Required Test Case Parameters:

211

- INPUT

212

- RETRIEVAL_CONTEXT

213

214

Attributes:

215

- score (float): Relevancy score (0-1)

216

- reason (str): Explanation identifying irrelevant context

217

- success (bool): Whether score meets threshold

218

"""

219

```

220

221

Usage example:

222

223

```python

224

from deepeval.metrics import ContextualRelevancyMetric

225

from deepeval.test_case import LLMTestCase

226

227

# Create metric

228

metric = ContextualRelevancyMetric(threshold=0.7)

229

230

# Test case

231

test_case = LLMTestCase(

232

input="What are the shipping costs to California?",

233

retrieval_context=[

234

"Shipping to California: $5.99 for standard, $12.99 for express",

235

"California has over 39 million residents", # Irrelevant

236

"Free shipping on orders over $50"

237

]

238

)

239

240

# Evaluate relevancy

241

metric.measure(test_case)

242

243

if not metric.success:

244

print(f"Irrelevant context detected: {metric.reason}")

245

```

246

247

### Contextual Precision Metric

248

249

Measures whether relevant context nodes are ranked higher than irrelevant ones in the retrieval context. Evaluates the ranking quality of the retrieval system.

250

251

```python { .api }

252

class ContextualPrecisionMetric:

253

"""

254

Measures whether relevant context nodes are ranked higher than irrelevant ones.

255

256

Parameters:

257

- threshold (float): Success threshold (0-1, default: 0.5)

258

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

259

- include_reason (bool): Include reason in output (default: True)

260

- async_mode (bool): Async mode (default: True)

261

- strict_mode (bool): Strict mode (default: False)

262

- verbose_mode (bool): Verbose mode (default: False)

263

264

Required Test Case Parameters:

265

- INPUT

266

- EXPECTED_OUTPUT

267

- RETRIEVAL_CONTEXT (order matters)

268

269

Attributes:

270

- score (float): Precision score (0-1)

271

- reason (str): Explanation of ranking issues

272

- success (bool): Whether score meets threshold

273

"""

274

```

275

276

Usage example:

277

278

```python

279

from deepeval.metrics import ContextualPrecisionMetric

280

from deepeval.test_case import LLMTestCase

281

282

# Create metric

283

metric = ContextualPrecisionMetric(threshold=0.7)

284

285

# Test case with ordered retrieval context

286

test_case = LLMTestCase(

287

input="What is the return policy?",

288

expected_output="30-day return policy with full refund",

289

retrieval_context=[

290

"California sales tax rate is 7.25%", # Irrelevant (ranked too high)

291

"All products have a 30-day return policy", # Relevant (should be first)

292

"Returns are processed within 5 business days" # Relevant

293

]

294

)

295

296

# Evaluate precision

297

metric.measure(test_case)

298

299

if not metric.success:

300

print(f"Ranking issue: {metric.reason}")

301

```

302

303

## Combined RAG Evaluation

304

305

Evaluate all RAG aspects together:

306

307

```python

308

from deepeval import evaluate

309

from deepeval.metrics import (

310

AnswerRelevancyMetric,

311

FaithfulnessMetric,

312

ContextualRecallMetric,

313

ContextualRelevancyMetric,

314

ContextualPrecisionMetric

315

)

316

from deepeval.test_case import LLMTestCase

317

318

# Create comprehensive RAG metrics

319

rag_metrics = [

320

AnswerRelevancyMetric(threshold=0.7),

321

FaithfulnessMetric(threshold=0.8),

322

ContextualRecallMetric(threshold=0.7),

323

ContextualRelevancyMetric(threshold=0.7),

324

ContextualPrecisionMetric(threshold=0.7)

325

]

326

327

# Test cases for RAG pipeline

328

test_cases = [

329

LLMTestCase(

330

input="What's the shipping policy?",

331

actual_output=rag_pipeline("What's the shipping policy?"),

332

expected_output="Free shipping on orders over $50, 3-5 business days",

333

retrieval_context=get_retrieval_context("What's the shipping policy?")

334

),

335

# ... more test cases

336

]

337

338

# Evaluate entire RAG pipeline

339

result = evaluate(test_cases, rag_metrics)

340

341

# Analyze results by metric type

342

for metric_name in ["Answer Relevancy", "Faithfulness", "Contextual Recall",

343

"Contextual Relevancy", "Contextual Precision"]:

344

scores = [tr.metrics[metric_name].score for tr in result.test_results]

345

avg_score = sum(scores) / len(scores)

346

print(f"{metric_name}: {avg_score:.2f}")

347

```

348

349

## Metric Customization

350

351

Customize metrics with specific models and configurations:

352

353

```python

354

from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

355

from deepeval.models import GPTModel

356

357

# Use specific model for evaluation

358

custom_model = GPTModel(model="gpt-4-turbo")

359

360

answer_relevancy = AnswerRelevancyMetric(

361

threshold=0.75,

362

model=custom_model,

363

include_reason=True,

364

strict_mode=True, # More stringent evaluation

365

verbose_mode=True # Print detailed logs

366

)

367

368

faithfulness = FaithfulnessMetric(

369

threshold=0.85,

370

model=custom_model

371

)

372

373

# Use in evaluation

374

test_case = LLMTestCase(...)

375

answer_relevancy.measure(test_case)

376

faithfulness.measure(test_case)

377

```

378

379

## RAGAS Composite Score

380

381

While DeepEval provides individual RAG metrics, you can compute a RAGAS-style composite score:

382

383

```python

384

from deepeval import evaluate

385

from deepeval.metrics import (

386

AnswerRelevancyMetric,

387

FaithfulnessMetric,

388

ContextualRecallMetric,

389

ContextualPrecisionMetric

390

)

391

392

# Evaluate with RAGAS component metrics

393

result = evaluate(test_cases, [

394

AnswerRelevancyMetric(),

395

FaithfulnessMetric(),

396

ContextualRecallMetric(),

397

ContextualPrecisionMetric()

398

])

399

400

# Compute RAGAS score (harmonic mean of component scores)

401

for test_result in result.test_results:

402

scores = [m.score for m in test_result.metrics.values()]

403

ragas_score = len(scores) / sum(1/s for s in scores if s > 0)

404

print(f"RAGAS Score: {ragas_score:.3f}")

405

```

406