Tessl Tile for pypi/deepeval@3.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agentic-metrics.md benchmarks.md content-quality-metrics.md conversational-metrics.md core-evaluation.md custom-metrics.md dataset.md index.md integrations.md models.md multimodal-metrics.md rag-metrics.md synthesizer.md test-cases.md tracing.md

agentic-metrics.mddocs/

0
# Agentic Metrics
1

2
Metrics for evaluating AI agents, including tool usage, task completion, plan quality, and goal achievement. These metrics assess how well agents perform complex, multi-step tasks.
3

4
## Imports
5

6
```python
7
from deepeval.metrics import (
8
    ToolCorrectnessMetric,
9
    TaskCompletionMetric,
10
    ToolUseMetric,
11
    PlanQualityMetric,
12
    PlanAdherenceMetric,
13
    StepEfficiencyMetric,
14
    GoalAccuracyMetric,
15
    MCPTaskCompletionMetric,
16
    MCPUseMetric
17
)
18
```
19

20
## Capabilities
21

22
### Tool Correctness Metric
23

24
Evaluates whether the correct tools were called with correct parameters.
25

26
```python { .api }
27
class ToolCorrectnessMetric:
28
    """
29
    Evaluates whether the correct tools were called with correct parameters.
30

31
    Parameters:
32
    - threshold (float): Success threshold (0-1, default: 0.5)
33
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
34
    - include_reason (bool): Include reason in output (default: True)
35
    - async_mode (bool): Async mode (default: True)
36

37
    Required Test Case Parameters:
38
    - TOOLS_CALLED
39
    - EXPECTED_TOOLS
40

41
    Attributes:
42
    - score (float): Tool correctness score (0-1)
43
    - reason (str): Explanation of tool usage issues
44
    - success (bool): Whether score meets threshold
45
    """
46
```
47

48
Usage example:
49

50
```python
51
from deepeval.metrics import ToolCorrectnessMetric
52
from deepeval.test_case import LLMTestCase, ToolCall
53

54
metric = ToolCorrectnessMetric(threshold=0.8)
55

56
test_case = LLMTestCase(
57
    input="What's the weather in New York and London?",
58
    actual_output="New York: 72°F, sunny. London: 18°C, cloudy.",
59
    tools_called=[
60
        ToolCall(
61
            name="get_weather",
62
            input_parameters={"city": "New York", "unit": "fahrenheit"},
63
            output={"temp": 72, "condition": "sunny"}
64
        ),
65
        ToolCall(
66
            name="get_weather",
67
            input_parameters={"city": "London", "unit": "celsius"},
68
            output={"temp": 18, "condition": "cloudy"}
69
        )
70
    ],
71
    expected_tools=[
72
        ToolCall(name="get_weather", input_parameters={"city": "New York"}),
73
        ToolCall(name="get_weather", input_parameters={"city": "London"})
74
    ]
75
)
76

77
metric.measure(test_case)
78

79
if metric.success:
80
    print("Tools used correctly")
81
else:
82
    print(f"Tool usage issues: {metric.reason}")
83
```
84

85
### Task Completion Metric
86

87
Evaluates whether the task was completed successfully based on the goal.
88

89
```python { .api }
90
class TaskCompletionMetric:
91
    """
92
    Evaluates whether the task was completed successfully.
93

94
    Parameters:
95
    - threshold (float): Success threshold (0-1, default: 0.5)
96
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
97
    - include_reason (bool): Include reason in output (default: True)
98

99
    Required Test Case Parameters:
100
    - INPUT (task description)
101
    - ACTUAL_OUTPUT (task result)
102

103
    Attributes:
104
    - score (float): Task completion score (0-1)
105
    - reason (str): Explanation of completion status
106
    - success (bool): Whether score meets threshold
107
    """
108
```
109

110
Usage example:
111

112
```python
113
from deepeval.metrics import TaskCompletionMetric
114
from deepeval.test_case import LLMTestCase
115

116
metric = TaskCompletionMetric(threshold=0.8)
117

118
test_case = LLMTestCase(
119
    input="Book a flight from NYC to LAX for next Monday, find a hotel near LAX, and create an itinerary.",
120
    actual_output="I've booked flight UA123 departing Monday 8am, reserved Hilton LAX for 3 nights, and created a 3-day itinerary including beach visits and city tours.",
121
    tools_called=[
122
        ToolCall(name="book_flight", input_parameters={"from": "NYC", "to": "LAX"}),
123
        ToolCall(name="book_hotel", input_parameters={"location": "LAX"}),
124
        ToolCall(name="create_itinerary", input_parameters={"days": 3})
125
    ]
126
)
127

128
metric.measure(test_case)
129
print(f"Task completion: {metric.score:.2f}")
130
```
131

132
### Tool Use Metric
133

134
Evaluates appropriate use of available tools.
135

136
```python { .api }
137
class ToolUseMetric:
138
    """
139
    Evaluates appropriate use of available tools.
140

141
    Parameters:
142
    - threshold (float): Success threshold (0-1, default: 0.5)
143
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
144
    - include_reason (bool): Include reason in output (default: True)
145

146
    Required Test Case Parameters:
147
    - INPUT
148
    - ACTUAL_OUTPUT
149
    - TOOLS_CALLED
150

151
    Attributes:
152
    - score (float): Tool use appropriateness score (0-1)
153
    - reason (str): Explanation of tool usage
154
    - success (bool): Whether score meets threshold
155
    """
156
```
157

158
Usage example:
159

160
```python
161
from deepeval.metrics import ToolUseMetric
162
from deepeval.test_case import LLMTestCase, ToolCall
163

164
metric = ToolUseMetric(threshold=0.7)
165

166
test_case = LLMTestCase(
167
    input="Find restaurants near me and check the weather",
168
    actual_output="Here are 5 restaurants near you. The weather is sunny, 75°F.",
169
    tools_called=[
170
        ToolCall(
171
            name="search_restaurants",
172
            input_parameters={"location": "current"}
173
        ),
174
        ToolCall(
175
            name="get_weather",
176
            input_parameters={"location": "current"}
177
        )
178
    ]
179
)
180

181
metric.measure(test_case)
182
```
183

184
### Plan Quality Metric
185

186
Evaluates the quality of generated plans.
187

188
```python { .api }
189
class PlanQualityMetric:
190
    """
191
    Evaluates the quality of generated plans.
192

193
    Parameters:
194
    - threshold (float): Success threshold (0-1, default: 0.5)
195
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
196
    - include_reason (bool): Include reason in output (default: True)
197

198
    Required Test Case Parameters:
199
    - INPUT (task/goal)
200
    - ACTUAL_OUTPUT (plan)
201

202
    Attributes:
203
    - score (float): Plan quality score (0-1)
204
    - reason (str): Explanation of plan quality
205
    - success (bool): Whether score meets threshold
206
    """
207
```
208

209
Usage example:
210

211
```python
212
from deepeval.metrics import PlanQualityMetric
213
from deepeval.test_case import LLMTestCase
214

215
metric = PlanQualityMetric(threshold=0.7)
216

217
test_case = LLMTestCase(
218
    input="Create a plan to migrate our database from MySQL to PostgreSQL",
219
    actual_output="""
220
    Plan:
221
    1. Audit current MySQL schema and data
222
    2. Set up PostgreSQL instance
223
    3. Create migration scripts for schema
224
    4. Test migration on staging environment
225
    5. Execute migration during maintenance window
226
    6. Validate data integrity post-migration
227
    7. Monitor performance for 48 hours
228
    """
229
)
230

231
metric.measure(test_case)
232
print(f"Plan quality: {metric.score:.2f}")
233
```
234

235
### Plan Adherence Metric
236

237
Measures adherence to a predefined plan.
238

239
```python { .api }
240
class PlanAdherenceMetric:
241
    """
242
    Measures adherence to a predefined plan.
243

244
    Parameters:
245
    - threshold (float): Success threshold (0-1, default: 0.5)
246
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
247
    - include_reason (bool): Include reason in output (default: True)
248

249
    Required Test Case Parameters:
250
    - EXPECTED_OUTPUT (plan)
251
    - ACTUAL_OUTPUT (execution)
252

253
    Attributes:
254
    - score (float): Plan adherence score (0-1)
255
    - reason (str): Explanation of deviations from plan
256
    - success (bool): Whether score meets threshold
257
    """
258
```
259

260
Usage example:
261

262
```python
263
from deepeval.metrics import PlanAdherenceMetric
264
from deepeval.test_case import LLMTestCase
265

266
metric = PlanAdherenceMetric(threshold=0.8)
267

268
test_case = LLMTestCase(
269
    input="Execute the database migration plan",
270
    expected_output="Plan: 1) Backup data, 2) Stop services, 3) Migrate, 4) Validate, 5) Restart",
271
    actual_output="Executed: Backed up data, stopped services, ran migration, validated results, restarted services",
272
    tools_called=[
273
        ToolCall(name="backup_database"),
274
        ToolCall(name="stop_services"),
275
        ToolCall(name="migrate_database"),
276
        ToolCall(name="validate_data"),
277
        ToolCall(name="start_services")
278
    ]
279
)
280

281
metric.measure(test_case)
282
```
283

284
### Step Efficiency Metric
285

286
Evaluates efficiency of steps taken to complete a task.
287

288
```python { .api }
289
class StepEfficiencyMetric:
290
    """
291
    Evaluates efficiency of steps taken to complete a task.
292

293
    Parameters:
294
    - threshold (float): Success threshold (0-1, default: 0.5)
295
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
296
    - include_reason (bool): Include reason in output (default: True)
297

298
    Required Test Case Parameters:
299
    - INPUT (task)
300
    - TOOLS_CALLED or ACTUAL_OUTPUT
301

302
    Attributes:
303
    - score (float): Efficiency score (0-1)
304
    - reason (str): Explanation of efficiency issues
305
    - success (bool): Whether score meets threshold
306
    """
307
```
308

309
Usage example:
310

311
```python
312
from deepeval.metrics import StepEfficiencyMetric
313
from deepeval.test_case import LLMTestCase, ToolCall
314

315
metric = StepEfficiencyMetric(threshold=0.7)
316

317
# Inefficient agent
318
test_case_inefficient = LLMTestCase(
319
    input="What's 2 + 2?",
320
    actual_output="4",
321
    tools_called=[
322
        ToolCall(name="search_web", input_parameters={"query": "2+2"}),
323
        ToolCall(name="calculator", input_parameters={"expression": "2+2"}),
324
        ToolCall(name="verify_answer", input_parameters={"answer": 4})
325
    ]
326
)
327

328
metric.measure(test_case_inefficient)
329
# Low score: Too many steps for simple calculation
330

331
# Efficient agent
332
test_case_efficient = LLMTestCase(
333
    input="What's 2 + 2?",
334
    actual_output="4",
335
    tools_called=[
336
        ToolCall(name="calculator", input_parameters={"expression": "2+2"})
337
    ]
338
)
339

340
metric.measure(test_case_efficient)
341
# High score: Direct and efficient
342
```
343

344
### Goal Accuracy Metric
345

346
Measures accuracy in achieving specified goals.
347

348
```python { .api }
349
class GoalAccuracyMetric:
350
    """
351
    Measures accuracy in achieving specified goals.
352

353
    Parameters:
354
    - threshold (float): Success threshold (0-1, default: 0.5)
355
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
356
    - include_reason (bool): Include reason in output (default: True)
357

358
    Required Test Case Parameters:
359
    - INPUT (goal)
360
    - ACTUAL_OUTPUT (result)
361
    - EXPECTED_OUTPUT (optional, expected result)
362

363
    Attributes:
364
    - score (float): Goal accuracy score (0-1)
365
    - reason (str): Explanation of goal achievement
366
    - success (bool): Whether score meets threshold
367
    """
368
```
369

370
Usage example:
371

372
```python
373
from deepeval.metrics import GoalAccuracyMetric
374
from deepeval.test_case import LLMTestCase
375

376
metric = GoalAccuracyMetric(threshold=0.8)
377

378
test_case = LLMTestCase(
379
    input="Goal: Increase website traffic by 20% through SEO optimization",
380
    actual_output="Implemented SEO changes: optimized meta tags, improved page speed, created quality backlinks. Result: 23% increase in traffic.",
381
    expected_output="20% increase in website traffic"
382
)
383

384
metric.measure(test_case)
385
print(f"Goal achievement: {metric.score:.2f}")
386
```
387

388
### MCP Task Completion Metric
389

390
Evaluates task completion using Model Context Protocol (MCP) tools.
391

392
```python { .api }
393
class MCPTaskCompletionMetric:
394
    """
395
    Evaluates task completion using MCP.
396

397
    Parameters:
398
    - threshold (float): Success threshold (0-1, default: 0.5)
399
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
400

401
    Required Test Case Parameters:
402
    - INPUT
403
    - ACTUAL_OUTPUT
404
    - MCP_TOOLS_CALLED
405

406
    Attributes:
407
    - score (float): MCP task completion score (0-1)
408
    - reason (str): Explanation of task completion with MCP
409
    - success (bool): Whether score meets threshold
410
    """
411
```
412

413
Usage example:
414

415
```python
416
from deepeval.metrics import MCPTaskCompletionMetric
417
from deepeval.test_case import LLMTestCase, MCPToolCall, MCPServer
418

419
metric = MCPTaskCompletionMetric(threshold=0.8)
420

421
test_case = LLMTestCase(
422
    input="Search for Python tutorials and save the top 5 results",
423
    actual_output="Found and saved 5 Python tutorials",
424
    mcp_servers=[
425
        MCPServer(
426
            server_name="search-server",
427
            available_tools=["web_search", "save_results"]
428
        )
429
    ],
430
    mcp_tools_called=[
431
        MCPToolCall(
432
            server_name="search-server",
433
            tool_name="web_search",
434
            arguments={"query": "Python tutorials", "limit": 5}
435
        ),
436
        MCPToolCall(
437
            server_name="search-server",
438
            tool_name="save_results",
439
            arguments={"results": [...]}
440
        )
441
    ]
442
)
443

444
metric.measure(test_case)
445
```
446

447
### MCP Use Metric
448

449
Evaluates proper MCP usage.
450

451
```python { .api }
452
class MCPUseMetric:
453
    """
454
    Evaluates proper MCP usage.
455

456
    Parameters:
457
    - threshold (float): Success threshold (0-1, default: 0.5)
458
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
459

460
    Required Test Case Parameters:
461
    - INPUT
462
    - MCP_TOOLS_CALLED
463
    - MCP_SERVERS
464

465
    Attributes:
466
    - score (float): MCP usage score (0-1)
467
    - reason (str): Explanation of MCP usage
468
    - success (bool): Whether score meets threshold
469
    """
470
```
471

472
## Comprehensive Agentic Evaluation
473

474
Evaluate all agentic capabilities:
475

476
```python
477
from deepeval import evaluate
478
from deepeval.metrics import (
479
    ToolCorrectnessMetric,
480
    TaskCompletionMetric,
481
    ToolUseMetric,
482
    PlanQualityMetric,
483
    StepEfficiencyMetric,
484
    GoalAccuracyMetric
485
)
486
from deepeval.test_case import LLMTestCase
487

488
# Create agentic metrics suite
489
agentic_metrics = [
490
    ToolCorrectnessMetric(threshold=0.8),
491
    TaskCompletionMetric(threshold=0.8),
492
    ToolUseMetric(threshold=0.7),
493
    PlanQualityMetric(threshold=0.7),
494
    StepEfficiencyMetric(threshold=0.7),
495
    GoalAccuracyMetric(threshold=0.8)
496
]
497

498
# Test agent on complex task
499
test_cases = [
500
    LLMTestCase(
501
        input="Plan and execute a customer onboarding workflow",
502
        actual_output=agent_execute("Plan and execute a customer onboarding workflow"),
503
        tools_called=get_agent_tools_called(),
504
        expected_tools=[...]
505
    )
506
]
507

508
# Evaluate agent
509
result = evaluate(test_cases, agentic_metrics)
510

511
# Analyze agent performance
512
for test_result in result.test_results:
513
    print(f"Agent Performance Report:")
514
    for metric_name, metric_result in test_result.metrics.items():
515
        status = "✓" if metric_result.success else "✗"
516
        print(f"  {status} {metric_name}: {metric_result.score:.2f}")
517
        if not metric_result.success:
518
            print(f"     Issue: {metric_result.reason}")
519
```
520

521
## Multi-Step Agent Evaluation
522

523
Evaluate agents on complex multi-step tasks:
524

525
```python
526
from deepeval.test_case import LLMTestCase, ToolCall
527
from deepeval.metrics import (
528
    TaskCompletionMetric,
529
    ToolCorrectnessMetric,
530
    StepEfficiencyMetric
531
)
532

533
# Complex multi-step task
534
test_case = LLMTestCase(
535
    input="""
536
    Research the top 3 competitors in the AI LLM space,
537
    create a comparison table of their features,
538
    and draft an executive summary
539
    """,
540
    actual_output="""
541
    Researched OpenAI, Anthropic, and Google.
542
    Created comparison table with pricing, capabilities, and APIs.
543
    Executive summary: [detailed summary]
544
    """,
545
    tools_called=[
546
        ToolCall(name="web_search", input_parameters={"query": "AI LLM providers"}),
547
        ToolCall(name="web_search", input_parameters={"query": "OpenAI features"}),
548
        ToolCall(name="web_search", input_parameters={"query": "Anthropic features"}),
549
        ToolCall(name="web_search", input_parameters={"query": "Google AI features"}),
550
        ToolCall(name="create_table", input_parameters={"data": [...]}),
551
        ToolCall(name="generate_summary", input_parameters={"content": [...]})
552
    ]
553
)
554

555
# Evaluate with multiple metrics
556
metrics = [
557
    TaskCompletionMetric(threshold=0.8),
558
    StepEfficiencyMetric(threshold=0.7),
559
    ToolCorrectnessMetric(threshold=0.8)
560
]
561

562
for metric in metrics:
563
    metric.measure(test_case)
564
    print(f"{metric.__class__.__name__}: {metric.score:.2f} - {metric.reason}")
565
```
566

Version

Tile

Files

agentic-metrics.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

agentic-metrics.mddocs/