Tessl Tile for pypi/deepeval@3.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agentic-metrics.md benchmarks.md content-quality-metrics.md conversational-metrics.md core-evaluation.md custom-metrics.md dataset.md index.md integrations.md models.md multimodal-metrics.md rag-metrics.md synthesizer.md test-cases.md tracing.md

custom-metrics.mddocs/

0
# Custom Metrics
1

2
Framework for creating custom evaluation metrics using G-Eval, DAG (Deep Acyclic Graph), or by extending base metric classes. Build metrics tailored to your specific evaluation needs.
3

4
## Imports
5

6
```python
7
from deepeval.metrics import GEval, DAGMetric, DeepAcyclicGraph
8
from deepeval.metrics import (
9
    BaseMetric,
10
    BaseConversationalMetric,
11
    BaseMultimodalMetric,
12
    BaseArenaMetric
13
)
14
from deepeval.test_case import LLMTestCaseParams
15
```
16

17
## Capabilities
18

19
### G-Eval Metric
20

21
Customizable metric based on the G-Eval framework for LLM-based evaluation with custom criteria.
22

23
```python { .api }
24
class GEval:
25
    """
26
    Customizable metric based on the G-Eval framework for LLM evaluation.
27

28
    Parameters:
29
    - name (str): Name of the metric
30
    - evaluation_params (List[LLMTestCaseParams]): Parameters to evaluate
31
    - criteria (str, optional): Evaluation criteria description
32
    - evaluation_steps (List[str], optional): Steps for evaluation
33
    - rubric (List[Rubric], optional): Scoring rubric
34
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
35
    - threshold (float): Success threshold (default: 0.5)
36
    - top_logprobs (int): Number of log probabilities to consider (default: 20)
37
    - async_mode (bool): Async mode (default: True)
38
    - strict_mode (bool): Strict mode (default: False)
39
    - verbose_mode (bool): Verbose mode (default: False)
40
    - evaluation_template (Type[GEvalTemplate]): Custom template (default: GEvalTemplate)
41

42
    Attributes:
43
    - score (float): Evaluation score (0-1)
44
    - reason (str): Explanation of the score
45
    - success (bool): Whether score meets threshold
46
    """
47
```
48

49
Usage example - Simple criteria:
50

51
```python
52
from deepeval.metrics import GEval
53
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
54

55
# Create custom metric with simple criteria
56
coherence_metric = GEval(
57
    name="Coherence",
58
    criteria="Determine if the response is coherent and logically structured.",
59
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
60
    threshold=0.7
61
)
62

63
test_case = LLMTestCase(
64
    input="Explain quantum computing",
65
    actual_output="Quantum computing uses quantum bits or qubits..."
66
)
67

68
coherence_metric.measure(test_case)
69
print(f"Coherence score: {coherence_metric.score:.2f}")
70
```
71

72
Usage example - With evaluation steps:
73

74
```python
75
from deepeval.metrics import GEval
76
from deepeval.test_case import LLMTestCaseParams
77

78
# Create metric with detailed evaluation steps
79
completeness_metric = GEval(
80
    name="Answer Completeness",
81
    criteria="Evaluate if the answer completely addresses all parts of the question.",
82
    evaluation_params=[
83
        LLMTestCaseParams.INPUT,
84
        LLMTestCaseParams.ACTUAL_OUTPUT
85
    ],
86
    evaluation_steps=[
87
        "Identify all parts of the question in the input",
88
        "Check if each part is addressed in the output",
89
        "Evaluate the depth and detail of each answer component",
90
        "Determine overall completeness score"
91
    ],
92
    threshold=0.8,
93
    model="gpt-4"
94
)
95

96
test_case = LLMTestCase(
97
    input="What is Python and what is it used for?",
98
    actual_output="Python is a high-level programming language. It's used for web development, data science, automation, and AI/ML applications."
99
)
100

101
completeness_metric.measure(test_case)
102
```
103

104
Usage example - With scoring rubric:
105

106
```python
107
from deepeval.metrics import GEval
108
from deepeval.test_case import LLMTestCaseParams
109

110
# Create metric with detailed rubric
111
code_quality_metric = GEval(
112
    name="Code Quality",
113
    criteria="Evaluate the quality of the code solution.",
114
    evaluation_params=[
115
        LLMTestCaseParams.INPUT,
116
        LLMTestCaseParams.ACTUAL_OUTPUT
117
    ],
118
    rubric={
119
        "Correctness": "Does the code solve the problem correctly?",
120
        "Efficiency": "Is the algorithm efficient?",
121
        "Readability": "Is the code well-structured and readable?",
122
        "Best Practices": "Does it follow Python best practices?"
123
    },
124
    threshold=0.8
125
)
126

127
test_case = LLMTestCase(
128
    input="Write a function to find the nth Fibonacci number",
129
    actual_output="""
130
def fibonacci(n):
131
    if n <= 1:
132
        return n
133
    return fibonacci(n-1) + fibonacci(n-2)
134
"""
135
)
136

137
code_quality_metric.measure(test_case)
138
```
139

140
### DAG Metric
141

142
Deep Acyclic Graph metric for evaluating structured reasoning and multi-step processes.
143

144
```python { .api }
145
class DAGMetric:
146
    """
147
    Deep Acyclic Graph metric for evaluating structured reasoning.
148

149
    Parameters:
150
    - name (str): Name of the metric
151
    - dag (DeepAcyclicGraph): DAG structure for evaluation
152
    - threshold (float): Success threshold (default: 0.5)
153
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
154

155
    Attributes:
156
    - score (float): DAG compliance score (0-1)
157
    - reason (str): Explanation of DAG evaluation
158
    - success (bool): Whether score meets threshold
159
    """
160

161
class DeepAcyclicGraph:
162
    """
163
    Helper class for DAG construction and validation.
164

165
    Methods:
166
    - add_node(id: str, description: str): Add a node to the DAG
167
    - add_edge(from_id: str, to_id: str): Add an edge between nodes
168
    - validate(): Validate DAG structure (no cycles)
169
    """
170
```
171

172
Usage example:
173

174
```python
175
from deepeval.metrics import DAGMetric, DeepAcyclicGraph
176
from deepeval.test_case import LLMTestCase
177

178
# Define reasoning DAG
179
reasoning_dag = DeepAcyclicGraph()
180

181
# Add nodes for reasoning steps
182
reasoning_dag.add_node("understand", "Understand the problem")
183
reasoning_dag.add_node("analyze", "Analyze requirements")
184
reasoning_dag.add_node("plan", "Create solution plan")
185
reasoning_dag.add_node("implement", "Implement solution")
186
reasoning_dag.add_node("verify", "Verify solution correctness")
187

188
# Define dependencies
189
reasoning_dag.add_edge("understand", "analyze")
190
reasoning_dag.add_edge("analyze", "plan")
191
reasoning_dag.add_edge("plan", "implement")
192
reasoning_dag.add_edge("implement", "verify")
193

194
# Create metric
195
dag_metric = DAGMetric(
196
    name="Problem Solving Process",
197
    dag=reasoning_dag,
198
    threshold=0.8
199
)
200

201
# Evaluate reasoning process
202
test_case = LLMTestCase(
203
    input="Solve: Find the maximum sum of a contiguous subarray",
204
    actual_output="""
205
First, I understand this is the maximum subarray problem.
206
Let me analyze: we need to find the subarray with largest sum.
207
I'll plan to use Kadane's algorithm for O(n) solution.
208
Here's the implementation: [code]
209
Verifying: tested with [-2,1,-3,4,-1,2,1,-5,4], got 6 (correct).
210
"""
211
)
212

213
dag_metric.measure(test_case)
214
print(f"Reasoning process score: {dag_metric.score:.2f}")
215
```
216

217
### Arena G-Eval
218

219
G-Eval for arena-style comparison between multiple outputs.
220

221
```python { .api }
222
class ArenaGEval:
223
    """
224
    Arena-style comparison using G-Eval methodology.
225

226
    Parameters:
227
    - name (str): Name of the metric
228
    - criteria (str): Evaluation criteria
229
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
230

231
    Attributes:
232
    - winner (str): Name of winning contestant
233
    - reason (str): Explanation of why winner was chosen
234
    - success (bool): Always True after evaluation
235
    """
236
```
237

238
Usage example:
239

240
```python
241
from deepeval.metrics import ArenaGEval
242
from deepeval.test_case import ArenaTestCase, LLMTestCase
243

244
# Create arena metric
245
arena_metric = ArenaGEval(
246
    name="Response Quality",
247
    criteria="Determine which response is more helpful, accurate, and well-written"
248
)
249

250
# Compare multiple model outputs
251
arena_test = ArenaTestCase(
252
    contestants={
253
        "model_a": LLMTestCase(
254
            input="Explain neural networks",
255
            actual_output="Neural networks are computational models inspired by biological brains..."
256
        ),
257
        "model_b": LLMTestCase(
258
            input="Explain neural networks",
259
            actual_output="A neural network is like... umm... it's a type of AI thing..."
260
        ),
261
        "model_c": LLMTestCase(
262
            input="Explain neural networks",
263
            actual_output="Neural networks are ML models with interconnected layers..."
264
        )
265
    }
266
)
267

268
arena_metric.measure(arena_test)
269
print(f"Winner: {arena_metric.winner}")
270
print(f"Reason: {arena_metric.reason}")
271
```
272

273
### Base Metric Classes
274

275
Extend base classes to create fully custom metrics.
276

277
```python { .api }
278
class BaseMetric:
279
    """
280
    Base class for all LLM test case metrics.
281

282
    Attributes:
283
    - threshold (float): Threshold for success
284
    - score (float, optional): Score from evaluation
285
    - reason (str, optional): Reason for the score
286
    - success (bool, optional): Whether the metric passed
287
    - strict_mode (bool): Whether to use strict mode
288
    - async_mode (bool): Whether to use async mode
289
    - verbose_mode (bool): Whether to use verbose mode
290

291
    Abstract Methods:
292
    - measure(test_case: LLMTestCase, *args, **kwargs) -> float
293
    - a_measure(test_case: LLMTestCase, *args, **kwargs) -> float
294
    - is_successful() -> bool
295
    """
296

297
class BaseConversationalMetric:
298
    """
299
    Base class for conversational metrics.
300

301
    Abstract Methods:
302
    - measure(test_case: ConversationalTestCase, *args, **kwargs) -> float
303
    - a_measure(test_case: ConversationalTestCase, *args, **kwargs) -> float
304
    - is_successful() -> bool
305
    """
306

307
class BaseMultimodalMetric:
308
    """
309
    Base class for multimodal metrics.
310

311
    Abstract Methods:
312
    - measure(test_case: MLLMTestCase, *args, **kwargs) -> float
313
    - a_measure(test_case: MLLMTestCase, *args, **kwargs) -> float
314
    - is_successful() -> bool
315
    """
316

317
class BaseArenaMetric:
318
    """
319
    Base class for arena-style comparison metrics.
320

321
    Abstract Methods:
322
    - measure(test_case: ArenaTestCase, *args, **kwargs) -> str
323
    - a_measure(test_case: ArenaTestCase, *args, **kwargs) -> str
324
    - is_successful() -> bool
325
    """
326
```
327

328
Usage example - Custom metric:
329

330
```python
331
from deepeval.metrics import BaseMetric
332
from deepeval.test_case import LLMTestCase
333
import re
334

335
class WordCountMetric(BaseMetric):
336
    """Custom metric to check if response meets word count requirements."""
337

338
    def __init__(self, min_words: int, max_words: int, threshold: float = 1.0):
339
        self.min_words = min_words
340
        self.max_words = max_words
341
        self.threshold = threshold
342

343
    def measure(self, test_case: LLMTestCase) -> float:
344
        """Measure if word count is within range."""
345
        words = len(test_case.actual_output.split())
346

347
        if self.min_words <= words <= self.max_words:
348
            self.score = 1.0
349
            self.reason = f"Word count {words} is within range [{self.min_words}, {self.max_words}]"
350
        else:
351
            self.score = 0.0
352
            self.reason = f"Word count {words} is outside range [{self.min_words}, {self.max_words}]"
353

354
        self.success = self.score >= self.threshold
355
        return self.score
356

357
    async def a_measure(self, test_case: LLMTestCase) -> float:
358
        """Async version of measure."""
359
        return self.measure(test_case)
360

361
    def is_successful(self) -> bool:
362
        """Check if metric passed."""
363
        return self.success
364

365
# Use custom metric
366
word_count_metric = WordCountMetric(min_words=50, max_words=100)
367

368
test_case = LLMTestCase(
369
    input="Write a brief summary of quantum computing",
370
    actual_output="Quantum computing uses quantum mechanics..." * 15  # ~75 words
371
)
372

373
word_count_metric.measure(test_case)
374
print(f"Success: {word_count_metric.success}")
375
```
376

377
Advanced custom metric with LLM:
378

379
```python
380
from deepeval.metrics import BaseMetric
381
from deepeval.models import GPTModel
382
from deepeval.test_case import LLMTestCase
383

384
class CustomToneMetric(BaseMetric):
385
    """Custom metric to evaluate tone of response."""
386

387
    def __init__(self, expected_tone: str, threshold: float = 0.7):
388
        self.expected_tone = expected_tone
389
        self.threshold = threshold
390
        self.model = GPTModel(model="gpt-4")
391

392
    def measure(self, test_case: LLMTestCase) -> float:
393
        """Evaluate tone using LLM."""
394
        prompt = f"""
395
        Evaluate if the following text has a {self.expected_tone} tone.
396
        Rate from 0.0 to 1.0 where 1.0 means perfect tone match.
397

398
        Text: {test_case.actual_output}
399

400
        Provide ONLY a number between 0.0 and 1.0.
401
        """
402

403
        response = self.model.generate(prompt)
404
        self.score = float(response.strip())
405
        self.success = self.score >= self.threshold
406
        self.reason = f"Tone match score: {self.score:.2f} for {self.expected_tone} tone"
407

408
        return self.score
409

410
    async def a_measure(self, test_case: LLMTestCase) -> float:
411
        """Async version."""
412
        return self.measure(test_case)
413

414
    def is_successful(self) -> bool:
415
        return self.success
416

417
# Use custom tone metric
418
friendly_tone = CustomToneMetric(expected_tone="friendly and professional")
419

420
test_case = LLMTestCase(
421
    input="Respond to customer complaint",
422
    actual_output="I sincerely apologize for the inconvenience. Let me help resolve this right away!"
423
)
424

425
friendly_tone.measure(test_case)
426
```
427

428
### Non-LLM Metrics
429

430
Simple pattern-based metrics without LLM evaluation.
431

432
```python { .api }
433
class ExactMatchMetric:
434
    """
435
    Simple exact string matching metric.
436

437
    Parameters:
438
    - threshold (float): Success threshold (default: 1.0)
439

440
    Required Test Case Parameters:
441
    - ACTUAL_OUTPUT
442
    - EXPECTED_OUTPUT
443
    """
444

445
class PatternMatchMetric:
446
    """
447
    Pattern matching using regular expressions.
448

449
    Parameters:
450
    - pattern (str): Regular expression pattern
451
    - threshold (float): Success threshold (default: 1.0)
452

453
    Required Test Case Parameters:
454
    - ACTUAL_OUTPUT
455
    """
456
```
457

458
Usage example:
459

460
```python
461
from deepeval.metrics import ExactMatchMetric, PatternMatchMetric
462
from deepeval.test_case import LLMTestCase
463

464
# Exact match
465
exact_metric = ExactMatchMetric()
466
test_case = LLMTestCase(
467
    input="What is 2+2?",
468
    actual_output="4",
469
    expected_output="4"
470
)
471
exact_metric.measure(test_case)
472

473
# Pattern match
474
email_pattern = PatternMatchMetric(pattern=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
475
test_case = LLMTestCase(
476
    input="Extract email",
477
    actual_output="Contact us at support@example.com"
478
)
479
email_pattern.measure(test_case)
480
print(f"Email found: {email_pattern.success}")
481
```
482

Version

Tile

Files

custom-metrics.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

custom-metrics.mddocs/