or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agentic-metrics.mdbenchmarks.mdcontent-quality-metrics.mdconversational-metrics.mdcore-evaluation.mdcustom-metrics.mddataset.mdindex.mdintegrations.mdmodels.mdmultimodal-metrics.mdrag-metrics.mdsynthesizer.mdtest-cases.mdtracing.md

custom-metrics.mddocs/

0

# Custom Metrics

1

2

Framework for creating custom evaluation metrics using G-Eval, DAG (Deep Acyclic Graph), or by extending base metric classes. Build metrics tailored to your specific evaluation needs.

3

4

## Imports

5

6

```python

7

from deepeval.metrics import GEval, DAGMetric, DeepAcyclicGraph

8

from deepeval.metrics import (

9

BaseMetric,

10

BaseConversationalMetric,

11

BaseMultimodalMetric,

12

BaseArenaMetric

13

)

14

from deepeval.test_case import LLMTestCaseParams

15

```

16

17

## Capabilities

18

19

### G-Eval Metric

20

21

Customizable metric based on the G-Eval framework for LLM-based evaluation with custom criteria.

22

23

```python { .api }

24

class GEval:

25

"""

26

Customizable metric based on the G-Eval framework for LLM evaluation.

27

28

Parameters:

29

- name (str): Name of the metric

30

- evaluation_params (List[LLMTestCaseParams]): Parameters to evaluate

31

- criteria (str, optional): Evaluation criteria description

32

- evaluation_steps (List[str], optional): Steps for evaluation

33

- rubric (List[Rubric], optional): Scoring rubric

34

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

35

- threshold (float): Success threshold (default: 0.5)

36

- top_logprobs (int): Number of log probabilities to consider (default: 20)

37

- async_mode (bool): Async mode (default: True)

38

- strict_mode (bool): Strict mode (default: False)

39

- verbose_mode (bool): Verbose mode (default: False)

40

- evaluation_template (Type[GEvalTemplate]): Custom template (default: GEvalTemplate)

41

42

Attributes:

43

- score (float): Evaluation score (0-1)

44

- reason (str): Explanation of the score

45

- success (bool): Whether score meets threshold

46

"""

47

```

48

49

Usage example - Simple criteria:

50

51

```python

52

from deepeval.metrics import GEval

53

from deepeval.test_case import LLMTestCase, LLMTestCaseParams

54

55

# Create custom metric with simple criteria

56

coherence_metric = GEval(

57

name="Coherence",

58

criteria="Determine if the response is coherent and logically structured.",

59

evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

60

threshold=0.7

61

)

62

63

test_case = LLMTestCase(

64

input="Explain quantum computing",

65

actual_output="Quantum computing uses quantum bits or qubits..."

66

)

67

68

coherence_metric.measure(test_case)

69

print(f"Coherence score: {coherence_metric.score:.2f}")

70

```

71

72

Usage example - With evaluation steps:

73

74

```python

75

from deepeval.metrics import GEval

76

from deepeval.test_case import LLMTestCaseParams

77

78

# Create metric with detailed evaluation steps

79

completeness_metric = GEval(

80

name="Answer Completeness",

81

criteria="Evaluate if the answer completely addresses all parts of the question.",

82

evaluation_params=[

83

LLMTestCaseParams.INPUT,

84

LLMTestCaseParams.ACTUAL_OUTPUT

85

],

86

evaluation_steps=[

87

"Identify all parts of the question in the input",

88

"Check if each part is addressed in the output",

89

"Evaluate the depth and detail of each answer component",

90

"Determine overall completeness score"

91

],

92

threshold=0.8,

93

model="gpt-4"

94

)

95

96

test_case = LLMTestCase(

97

input="What is Python and what is it used for?",

98

actual_output="Python is a high-level programming language. It's used for web development, data science, automation, and AI/ML applications."

99

)

100

101

completeness_metric.measure(test_case)

102

```

103

104

Usage example - With scoring rubric:

105

106

```python

107

from deepeval.metrics import GEval

108

from deepeval.test_case import LLMTestCaseParams

109

110

# Create metric with detailed rubric

111

code_quality_metric = GEval(

112

name="Code Quality",

113

criteria="Evaluate the quality of the code solution.",

114

evaluation_params=[

115

LLMTestCaseParams.INPUT,

116

LLMTestCaseParams.ACTUAL_OUTPUT

117

],

118

rubric={

119

"Correctness": "Does the code solve the problem correctly?",

120

"Efficiency": "Is the algorithm efficient?",

121

"Readability": "Is the code well-structured and readable?",

122

"Best Practices": "Does it follow Python best practices?"

123

},

124

threshold=0.8

125

)

126

127

test_case = LLMTestCase(

128

input="Write a function to find the nth Fibonacci number",

129

actual_output="""

130

def fibonacci(n):

131

if n <= 1:

132

return n

133

return fibonacci(n-1) + fibonacci(n-2)

134

"""

135

)

136

137

code_quality_metric.measure(test_case)

138

```

139

140

### DAG Metric

141

142

Deep Acyclic Graph metric for evaluating structured reasoning and multi-step processes.

143

144

```python { .api }

145

class DAGMetric:

146

"""

147

Deep Acyclic Graph metric for evaluating structured reasoning.

148

149

Parameters:

150

- name (str): Name of the metric

151

- dag (DeepAcyclicGraph): DAG structure for evaluation

152

- threshold (float): Success threshold (default: 0.5)

153

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

154

155

Attributes:

156

- score (float): DAG compliance score (0-1)

157

- reason (str): Explanation of DAG evaluation

158

- success (bool): Whether score meets threshold

159

"""

160

161

class DeepAcyclicGraph:

162

"""

163

Helper class for DAG construction and validation.

164

165

Methods:

166

- add_node(id: str, description: str): Add a node to the DAG

167

- add_edge(from_id: str, to_id: str): Add an edge between nodes

168

- validate(): Validate DAG structure (no cycles)

169

"""

170

```

171

172

Usage example:

173

174

```python

175

from deepeval.metrics import DAGMetric, DeepAcyclicGraph

176

from deepeval.test_case import LLMTestCase

177

178

# Define reasoning DAG

179

reasoning_dag = DeepAcyclicGraph()

180

181

# Add nodes for reasoning steps

182

reasoning_dag.add_node("understand", "Understand the problem")

183

reasoning_dag.add_node("analyze", "Analyze requirements")

184

reasoning_dag.add_node("plan", "Create solution plan")

185

reasoning_dag.add_node("implement", "Implement solution")

186

reasoning_dag.add_node("verify", "Verify solution correctness")

187

188

# Define dependencies

189

reasoning_dag.add_edge("understand", "analyze")

190

reasoning_dag.add_edge("analyze", "plan")

191

reasoning_dag.add_edge("plan", "implement")

192

reasoning_dag.add_edge("implement", "verify")

193

194

# Create metric

195

dag_metric = DAGMetric(

196

name="Problem Solving Process",

197

dag=reasoning_dag,

198

threshold=0.8

199

)

200

201

# Evaluate reasoning process

202

test_case = LLMTestCase(

203

input="Solve: Find the maximum sum of a contiguous subarray",

204

actual_output="""

205

First, I understand this is the maximum subarray problem.

206

Let me analyze: we need to find the subarray with largest sum.

207

I'll plan to use Kadane's algorithm for O(n) solution.

208

Here's the implementation: [code]

209

Verifying: tested with [-2,1,-3,4,-1,2,1,-5,4], got 6 (correct).

210

"""

211

)

212

213

dag_metric.measure(test_case)

214

print(f"Reasoning process score: {dag_metric.score:.2f}")

215

```

216

217

### Arena G-Eval

218

219

G-Eval for arena-style comparison between multiple outputs.

220

221

```python { .api }

222

class ArenaGEval:

223

"""

224

Arena-style comparison using G-Eval methodology.

225

226

Parameters:

227

- name (str): Name of the metric

228

- criteria (str): Evaluation criteria

229

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

230

231

Attributes:

232

- winner (str): Name of winning contestant

233

- reason (str): Explanation of why winner was chosen

234

- success (bool): Always True after evaluation

235

"""

236

```

237

238

Usage example:

239

240

```python

241

from deepeval.metrics import ArenaGEval

242

from deepeval.test_case import ArenaTestCase, LLMTestCase

243

244

# Create arena metric

245

arena_metric = ArenaGEval(

246

name="Response Quality",

247

criteria="Determine which response is more helpful, accurate, and well-written"

248

)

249

250

# Compare multiple model outputs

251

arena_test = ArenaTestCase(

252

contestants={

253

"model_a": LLMTestCase(

254

input="Explain neural networks",

255

actual_output="Neural networks are computational models inspired by biological brains..."

256

),

257

"model_b": LLMTestCase(

258

input="Explain neural networks",

259

actual_output="A neural network is like... umm... it's a type of AI thing..."

260

),

261

"model_c": LLMTestCase(

262

input="Explain neural networks",

263

actual_output="Neural networks are ML models with interconnected layers..."

264

)

265

}

266

)

267

268

arena_metric.measure(arena_test)

269

print(f"Winner: {arena_metric.winner}")

270

print(f"Reason: {arena_metric.reason}")

271

```

272

273

### Base Metric Classes

274

275

Extend base classes to create fully custom metrics.

276

277

```python { .api }

278

class BaseMetric:

279

"""

280

Base class for all LLM test case metrics.

281

282

Attributes:

283

- threshold (float): Threshold for success

284

- score (float, optional): Score from evaluation

285

- reason (str, optional): Reason for the score

286

- success (bool, optional): Whether the metric passed

287

- strict_mode (bool): Whether to use strict mode

288

- async_mode (bool): Whether to use async mode

289

- verbose_mode (bool): Whether to use verbose mode

290

291

Abstract Methods:

292

- measure(test_case: LLMTestCase, *args, **kwargs) -> float

293

- a_measure(test_case: LLMTestCase, *args, **kwargs) -> float

294

- is_successful() -> bool

295

"""

296

297

class BaseConversationalMetric:

298

"""

299

Base class for conversational metrics.

300

301

Abstract Methods:

302

- measure(test_case: ConversationalTestCase, *args, **kwargs) -> float

303

- a_measure(test_case: ConversationalTestCase, *args, **kwargs) -> float

304

- is_successful() -> bool

305

"""

306

307

class BaseMultimodalMetric:

308

"""

309

Base class for multimodal metrics.

310

311

Abstract Methods:

312

- measure(test_case: MLLMTestCase, *args, **kwargs) -> float

313

- a_measure(test_case: MLLMTestCase, *args, **kwargs) -> float

314

- is_successful() -> bool

315

"""

316

317

class BaseArenaMetric:

318

"""

319

Base class for arena-style comparison metrics.

320

321

Abstract Methods:

322

- measure(test_case: ArenaTestCase, *args, **kwargs) -> str

323

- a_measure(test_case: ArenaTestCase, *args, **kwargs) -> str

324

- is_successful() -> bool

325

"""

326

```

327

328

Usage example - Custom metric:

329

330

```python

331

from deepeval.metrics import BaseMetric

332

from deepeval.test_case import LLMTestCase

333

import re

334

335

class WordCountMetric(BaseMetric):

336

"""Custom metric to check if response meets word count requirements."""

337

338

def __init__(self, min_words: int, max_words: int, threshold: float = 1.0):

339

self.min_words = min_words

340

self.max_words = max_words

341

self.threshold = threshold

342

343

def measure(self, test_case: LLMTestCase) -> float:

344

"""Measure if word count is within range."""

345

words = len(test_case.actual_output.split())

346

347

if self.min_words <= words <= self.max_words:

348

self.score = 1.0

349

self.reason = f"Word count {words} is within range [{self.min_words}, {self.max_words}]"

350

else:

351

self.score = 0.0

352

self.reason = f"Word count {words} is outside range [{self.min_words}, {self.max_words}]"

353

354

self.success = self.score >= self.threshold

355

return self.score

356

357

async def a_measure(self, test_case: LLMTestCase) -> float:

358

"""Async version of measure."""

359

return self.measure(test_case)

360

361

def is_successful(self) -> bool:

362

"""Check if metric passed."""

363

return self.success

364

365

# Use custom metric

366

word_count_metric = WordCountMetric(min_words=50, max_words=100)

367

368

test_case = LLMTestCase(

369

input="Write a brief summary of quantum computing",

370

actual_output="Quantum computing uses quantum mechanics..." * 15 # ~75 words

371

)

372

373

word_count_metric.measure(test_case)

374

print(f"Success: {word_count_metric.success}")

375

```

376

377

Advanced custom metric with LLM:

378

379

```python

380

from deepeval.metrics import BaseMetric

381

from deepeval.models import GPTModel

382

from deepeval.test_case import LLMTestCase

383

384

class CustomToneMetric(BaseMetric):

385

"""Custom metric to evaluate tone of response."""

386

387

def __init__(self, expected_tone: str, threshold: float = 0.7):

388

self.expected_tone = expected_tone

389

self.threshold = threshold

390

self.model = GPTModel(model="gpt-4")

391

392

def measure(self, test_case: LLMTestCase) -> float:

393

"""Evaluate tone using LLM."""

394

prompt = f"""

395

Evaluate if the following text has a {self.expected_tone} tone.

396

Rate from 0.0 to 1.0 where 1.0 means perfect tone match.

397

398

Text: {test_case.actual_output}

399

400

Provide ONLY a number between 0.0 and 1.0.

401

"""

402

403

response = self.model.generate(prompt)

404

self.score = float(response.strip())

405

self.success = self.score >= self.threshold

406

self.reason = f"Tone match score: {self.score:.2f} for {self.expected_tone} tone"

407

408

return self.score

409

410

async def a_measure(self, test_case: LLMTestCase) -> float:

411

"""Async version."""

412

return self.measure(test_case)

413

414

def is_successful(self) -> bool:

415

return self.success

416

417

# Use custom tone metric

418

friendly_tone = CustomToneMetric(expected_tone="friendly and professional")

419

420

test_case = LLMTestCase(

421

input="Respond to customer complaint",

422

actual_output="I sincerely apologize for the inconvenience. Let me help resolve this right away!"

423

)

424

425

friendly_tone.measure(test_case)

426

```

427

428

### Non-LLM Metrics

429

430

Simple pattern-based metrics without LLM evaluation.

431

432

```python { .api }

433

class ExactMatchMetric:

434

"""

435

Simple exact string matching metric.

436

437

Parameters:

438

- threshold (float): Success threshold (default: 1.0)

439

440

Required Test Case Parameters:

441

- ACTUAL_OUTPUT

442

- EXPECTED_OUTPUT

443

"""

444

445

class PatternMatchMetric:

446

"""

447

Pattern matching using regular expressions.

448

449

Parameters:

450

- pattern (str): Regular expression pattern

451

- threshold (float): Success threshold (default: 1.0)

452

453

Required Test Case Parameters:

454

- ACTUAL_OUTPUT

455

"""

456

```

457

458

Usage example:

459

460

```python

461

from deepeval.metrics import ExactMatchMetric, PatternMatchMetric

462

from deepeval.test_case import LLMTestCase

463

464

# Exact match

465

exact_metric = ExactMatchMetric()

466

test_case = LLMTestCase(

467

input="What is 2+2?",

468

actual_output="4",

469

expected_output="4"

470

)

471

exact_metric.measure(test_case)

472

473

# Pattern match

474

email_pattern = PatternMatchMetric(pattern=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')

475

test_case = LLMTestCase(

476

input="Extract email",

477

actual_output="Contact us at support@example.com"

478

)

479

email_pattern.measure(test_case)

480

print(f"Email found: {email_pattern.success}")

481

```

482