0
# Custom Metrics
1
2
Framework for creating custom evaluation metrics using G-Eval, DAG (Deep Acyclic Graph), or by extending base metric classes. Build metrics tailored to your specific evaluation needs.
3
4
## Imports
5
6
```python
7
from deepeval.metrics import GEval, DAGMetric, DeepAcyclicGraph
8
from deepeval.metrics import (
9
BaseMetric,
10
BaseConversationalMetric,
11
BaseMultimodalMetric,
12
BaseArenaMetric
13
)
14
from deepeval.test_case import LLMTestCaseParams
15
```
16
17
## Capabilities
18
19
### G-Eval Metric
20
21
Customizable metric based on the G-Eval framework for LLM-based evaluation with custom criteria.
22
23
```python { .api }
24
class GEval:
25
"""
26
Customizable metric based on the G-Eval framework for LLM evaluation.
27
28
Parameters:
29
- name (str): Name of the metric
30
- evaluation_params (List[LLMTestCaseParams]): Parameters to evaluate
31
- criteria (str, optional): Evaluation criteria description
32
- evaluation_steps (List[str], optional): Steps for evaluation
33
- rubric (List[Rubric], optional): Scoring rubric
34
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
35
- threshold (float): Success threshold (default: 0.5)
36
- top_logprobs (int): Number of log probabilities to consider (default: 20)
37
- async_mode (bool): Async mode (default: True)
38
- strict_mode (bool): Strict mode (default: False)
39
- verbose_mode (bool): Verbose mode (default: False)
40
- evaluation_template (Type[GEvalTemplate]): Custom template (default: GEvalTemplate)
41
42
Attributes:
43
- score (float): Evaluation score (0-1)
44
- reason (str): Explanation of the score
45
- success (bool): Whether score meets threshold
46
"""
47
```
48
49
Usage example - Simple criteria:
50
51
```python
52
from deepeval.metrics import GEval
53
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
54
55
# Create custom metric with simple criteria
56
coherence_metric = GEval(
57
name="Coherence",
58
criteria="Determine if the response is coherent and logically structured.",
59
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
60
threshold=0.7
61
)
62
63
test_case = LLMTestCase(
64
input="Explain quantum computing",
65
actual_output="Quantum computing uses quantum bits or qubits..."
66
)
67
68
coherence_metric.measure(test_case)
69
print(f"Coherence score: {coherence_metric.score:.2f}")
70
```
71
72
Usage example - With evaluation steps:
73
74
```python
75
from deepeval.metrics import GEval
76
from deepeval.test_case import LLMTestCaseParams
77
78
# Create metric with detailed evaluation steps
79
completeness_metric = GEval(
80
name="Answer Completeness",
81
criteria="Evaluate if the answer completely addresses all parts of the question.",
82
evaluation_params=[
83
LLMTestCaseParams.INPUT,
84
LLMTestCaseParams.ACTUAL_OUTPUT
85
],
86
evaluation_steps=[
87
"Identify all parts of the question in the input",
88
"Check if each part is addressed in the output",
89
"Evaluate the depth and detail of each answer component",
90
"Determine overall completeness score"
91
],
92
threshold=0.8,
93
model="gpt-4"
94
)
95
96
test_case = LLMTestCase(
97
input="What is Python and what is it used for?",
98
actual_output="Python is a high-level programming language. It's used for web development, data science, automation, and AI/ML applications."
99
)
100
101
completeness_metric.measure(test_case)
102
```
103
104
Usage example - With scoring rubric:
105
106
```python
107
from deepeval.metrics import GEval
108
from deepeval.test_case import LLMTestCaseParams
109
110
# Create metric with detailed rubric
111
code_quality_metric = GEval(
112
name="Code Quality",
113
criteria="Evaluate the quality of the code solution.",
114
evaluation_params=[
115
LLMTestCaseParams.INPUT,
116
LLMTestCaseParams.ACTUAL_OUTPUT
117
],
118
rubric={
119
"Correctness": "Does the code solve the problem correctly?",
120
"Efficiency": "Is the algorithm efficient?",
121
"Readability": "Is the code well-structured and readable?",
122
"Best Practices": "Does it follow Python best practices?"
123
},
124
threshold=0.8
125
)
126
127
test_case = LLMTestCase(
128
input="Write a function to find the nth Fibonacci number",
129
actual_output="""
130
def fibonacci(n):
131
if n <= 1:
132
return n
133
return fibonacci(n-1) + fibonacci(n-2)
134
"""
135
)
136
137
code_quality_metric.measure(test_case)
138
```
139
140
### DAG Metric
141
142
Deep Acyclic Graph metric for evaluating structured reasoning and multi-step processes.
143
144
```python { .api }
145
class DAGMetric:
146
"""
147
Deep Acyclic Graph metric for evaluating structured reasoning.
148
149
Parameters:
150
- name (str): Name of the metric
151
- dag (DeepAcyclicGraph): DAG structure for evaluation
152
- threshold (float): Success threshold (default: 0.5)
153
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
154
155
Attributes:
156
- score (float): DAG compliance score (0-1)
157
- reason (str): Explanation of DAG evaluation
158
- success (bool): Whether score meets threshold
159
"""
160
161
class DeepAcyclicGraph:
162
"""
163
Helper class for DAG construction and validation.
164
165
Methods:
166
- add_node(id: str, description: str): Add a node to the DAG
167
- add_edge(from_id: str, to_id: str): Add an edge between nodes
168
- validate(): Validate DAG structure (no cycles)
169
"""
170
```
171
172
Usage example:
173
174
```python
175
from deepeval.metrics import DAGMetric, DeepAcyclicGraph
176
from deepeval.test_case import LLMTestCase
177
178
# Define reasoning DAG
179
reasoning_dag = DeepAcyclicGraph()
180
181
# Add nodes for reasoning steps
182
reasoning_dag.add_node("understand", "Understand the problem")
183
reasoning_dag.add_node("analyze", "Analyze requirements")
184
reasoning_dag.add_node("plan", "Create solution plan")
185
reasoning_dag.add_node("implement", "Implement solution")
186
reasoning_dag.add_node("verify", "Verify solution correctness")
187
188
# Define dependencies
189
reasoning_dag.add_edge("understand", "analyze")
190
reasoning_dag.add_edge("analyze", "plan")
191
reasoning_dag.add_edge("plan", "implement")
192
reasoning_dag.add_edge("implement", "verify")
193
194
# Create metric
195
dag_metric = DAGMetric(
196
name="Problem Solving Process",
197
dag=reasoning_dag,
198
threshold=0.8
199
)
200
201
# Evaluate reasoning process
202
test_case = LLMTestCase(
203
input="Solve: Find the maximum sum of a contiguous subarray",
204
actual_output="""
205
First, I understand this is the maximum subarray problem.
206
Let me analyze: we need to find the subarray with largest sum.
207
I'll plan to use Kadane's algorithm for O(n) solution.
208
Here's the implementation: [code]
209
Verifying: tested with [-2,1,-3,4,-1,2,1,-5,4], got 6 (correct).
210
"""
211
)
212
213
dag_metric.measure(test_case)
214
print(f"Reasoning process score: {dag_metric.score:.2f}")
215
```
216
217
### Arena G-Eval
218
219
G-Eval for arena-style comparison between multiple outputs.
220
221
```python { .api }
222
class ArenaGEval:
223
"""
224
Arena-style comparison using G-Eval methodology.
225
226
Parameters:
227
- name (str): Name of the metric
228
- criteria (str): Evaluation criteria
229
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
230
231
Attributes:
232
- winner (str): Name of winning contestant
233
- reason (str): Explanation of why winner was chosen
234
- success (bool): Always True after evaluation
235
"""
236
```
237
238
Usage example:
239
240
```python
241
from deepeval.metrics import ArenaGEval
242
from deepeval.test_case import ArenaTestCase, LLMTestCase
243
244
# Create arena metric
245
arena_metric = ArenaGEval(
246
name="Response Quality",
247
criteria="Determine which response is more helpful, accurate, and well-written"
248
)
249
250
# Compare multiple model outputs
251
arena_test = ArenaTestCase(
252
contestants={
253
"model_a": LLMTestCase(
254
input="Explain neural networks",
255
actual_output="Neural networks are computational models inspired by biological brains..."
256
),
257
"model_b": LLMTestCase(
258
input="Explain neural networks",
259
actual_output="A neural network is like... umm... it's a type of AI thing..."
260
),
261
"model_c": LLMTestCase(
262
input="Explain neural networks",
263
actual_output="Neural networks are ML models with interconnected layers..."
264
)
265
}
266
)
267
268
arena_metric.measure(arena_test)
269
print(f"Winner: {arena_metric.winner}")
270
print(f"Reason: {arena_metric.reason}")
271
```
272
273
### Base Metric Classes
274
275
Extend base classes to create fully custom metrics.
276
277
```python { .api }
278
class BaseMetric:
279
"""
280
Base class for all LLM test case metrics.
281
282
Attributes:
283
- threshold (float): Threshold for success
284
- score (float, optional): Score from evaluation
285
- reason (str, optional): Reason for the score
286
- success (bool, optional): Whether the metric passed
287
- strict_mode (bool): Whether to use strict mode
288
- async_mode (bool): Whether to use async mode
289
- verbose_mode (bool): Whether to use verbose mode
290
291
Abstract Methods:
292
- measure(test_case: LLMTestCase, *args, **kwargs) -> float
293
- a_measure(test_case: LLMTestCase, *args, **kwargs) -> float
294
- is_successful() -> bool
295
"""
296
297
class BaseConversationalMetric:
298
"""
299
Base class for conversational metrics.
300
301
Abstract Methods:
302
- measure(test_case: ConversationalTestCase, *args, **kwargs) -> float
303
- a_measure(test_case: ConversationalTestCase, *args, **kwargs) -> float
304
- is_successful() -> bool
305
"""
306
307
class BaseMultimodalMetric:
308
"""
309
Base class for multimodal metrics.
310
311
Abstract Methods:
312
- measure(test_case: MLLMTestCase, *args, **kwargs) -> float
313
- a_measure(test_case: MLLMTestCase, *args, **kwargs) -> float
314
- is_successful() -> bool
315
"""
316
317
class BaseArenaMetric:
318
"""
319
Base class for arena-style comparison metrics.
320
321
Abstract Methods:
322
- measure(test_case: ArenaTestCase, *args, **kwargs) -> str
323
- a_measure(test_case: ArenaTestCase, *args, **kwargs) -> str
324
- is_successful() -> bool
325
"""
326
```
327
328
Usage example - Custom metric:
329
330
```python
331
from deepeval.metrics import BaseMetric
332
from deepeval.test_case import LLMTestCase
333
import re
334
335
class WordCountMetric(BaseMetric):
336
"""Custom metric to check if response meets word count requirements."""
337
338
def __init__(self, min_words: int, max_words: int, threshold: float = 1.0):
339
self.min_words = min_words
340
self.max_words = max_words
341
self.threshold = threshold
342
343
def measure(self, test_case: LLMTestCase) -> float:
344
"""Measure if word count is within range."""
345
words = len(test_case.actual_output.split())
346
347
if self.min_words <= words <= self.max_words:
348
self.score = 1.0
349
self.reason = f"Word count {words} is within range [{self.min_words}, {self.max_words}]"
350
else:
351
self.score = 0.0
352
self.reason = f"Word count {words} is outside range [{self.min_words}, {self.max_words}]"
353
354
self.success = self.score >= self.threshold
355
return self.score
356
357
async def a_measure(self, test_case: LLMTestCase) -> float:
358
"""Async version of measure."""
359
return self.measure(test_case)
360
361
def is_successful(self) -> bool:
362
"""Check if metric passed."""
363
return self.success
364
365
# Use custom metric
366
word_count_metric = WordCountMetric(min_words=50, max_words=100)
367
368
test_case = LLMTestCase(
369
input="Write a brief summary of quantum computing",
370
actual_output="Quantum computing uses quantum mechanics..." * 15 # ~75 words
371
)
372
373
word_count_metric.measure(test_case)
374
print(f"Success: {word_count_metric.success}")
375
```
376
377
Advanced custom metric with LLM:
378
379
```python
380
from deepeval.metrics import BaseMetric
381
from deepeval.models import GPTModel
382
from deepeval.test_case import LLMTestCase
383
384
class CustomToneMetric(BaseMetric):
385
"""Custom metric to evaluate tone of response."""
386
387
def __init__(self, expected_tone: str, threshold: float = 0.7):
388
self.expected_tone = expected_tone
389
self.threshold = threshold
390
self.model = GPTModel(model="gpt-4")
391
392
def measure(self, test_case: LLMTestCase) -> float:
393
"""Evaluate tone using LLM."""
394
prompt = f"""
395
Evaluate if the following text has a {self.expected_tone} tone.
396
Rate from 0.0 to 1.0 where 1.0 means perfect tone match.
397
398
Text: {test_case.actual_output}
399
400
Provide ONLY a number between 0.0 and 1.0.
401
"""
402
403
response = self.model.generate(prompt)
404
self.score = float(response.strip())
405
self.success = self.score >= self.threshold
406
self.reason = f"Tone match score: {self.score:.2f} for {self.expected_tone} tone"
407
408
return self.score
409
410
async def a_measure(self, test_case: LLMTestCase) -> float:
411
"""Async version."""
412
return self.measure(test_case)
413
414
def is_successful(self) -> bool:
415
return self.success
416
417
# Use custom tone metric
418
friendly_tone = CustomToneMetric(expected_tone="friendly and professional")
419
420
test_case = LLMTestCase(
421
input="Respond to customer complaint",
422
actual_output="I sincerely apologize for the inconvenience. Let me help resolve this right away!"
423
)
424
425
friendly_tone.measure(test_case)
426
```
427
428
### Non-LLM Metrics
429
430
Simple pattern-based metrics without LLM evaluation.
431
432
```python { .api }
433
class ExactMatchMetric:
434
"""
435
Simple exact string matching metric.
436
437
Parameters:
438
- threshold (float): Success threshold (default: 1.0)
439
440
Required Test Case Parameters:
441
- ACTUAL_OUTPUT
442
- EXPECTED_OUTPUT
443
"""
444
445
class PatternMatchMetric:
446
"""
447
Pattern matching using regular expressions.
448
449
Parameters:
450
- pattern (str): Regular expression pattern
451
- threshold (float): Success threshold (default: 1.0)
452
453
Required Test Case Parameters:
454
- ACTUAL_OUTPUT
455
"""
456
```
457
458
Usage example:
459
460
```python
461
from deepeval.metrics import ExactMatchMetric, PatternMatchMetric
462
from deepeval.test_case import LLMTestCase
463
464
# Exact match
465
exact_metric = ExactMatchMetric()
466
test_case = LLMTestCase(
467
input="What is 2+2?",
468
actual_output="4",
469
expected_output="4"
470
)
471
exact_metric.measure(test_case)
472
473
# Pattern match
474
email_pattern = PatternMatchMetric(pattern=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
475
test_case = LLMTestCase(
476
input="Extract email",
477
actual_output="Contact us at support@example.com"
478
)
479
email_pattern.measure(test_case)
480
print(f"Email found: {email_pattern.success}")
481
```
482