0
# Agentic Metrics
1
2
Metrics for evaluating AI agents, including tool usage, task completion, plan quality, and goal achievement. These metrics assess how well agents perform complex, multi-step tasks.
3
4
## Imports
5
6
```python
7
from deepeval.metrics import (
8
ToolCorrectnessMetric,
9
TaskCompletionMetric,
10
ToolUseMetric,
11
PlanQualityMetric,
12
PlanAdherenceMetric,
13
StepEfficiencyMetric,
14
GoalAccuracyMetric,
15
MCPTaskCompletionMetric,
16
MCPUseMetric
17
)
18
```
19
20
## Capabilities
21
22
### Tool Correctness Metric
23
24
Evaluates whether the correct tools were called with correct parameters.
25
26
```python { .api }
27
class ToolCorrectnessMetric:
28
"""
29
Evaluates whether the correct tools were called with correct parameters.
30
31
Parameters:
32
- threshold (float): Success threshold (0-1, default: 0.5)
33
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
34
- include_reason (bool): Include reason in output (default: True)
35
- async_mode (bool): Async mode (default: True)
36
37
Required Test Case Parameters:
38
- TOOLS_CALLED
39
- EXPECTED_TOOLS
40
41
Attributes:
42
- score (float): Tool correctness score (0-1)
43
- reason (str): Explanation of tool usage issues
44
- success (bool): Whether score meets threshold
45
"""
46
```
47
48
Usage example:
49
50
```python
51
from deepeval.metrics import ToolCorrectnessMetric
52
from deepeval.test_case import LLMTestCase, ToolCall
53
54
metric = ToolCorrectnessMetric(threshold=0.8)
55
56
test_case = LLMTestCase(
57
input="What's the weather in New York and London?",
58
actual_output="New York: 72°F, sunny. London: 18°C, cloudy.",
59
tools_called=[
60
ToolCall(
61
name="get_weather",
62
input_parameters={"city": "New York", "unit": "fahrenheit"},
63
output={"temp": 72, "condition": "sunny"}
64
),
65
ToolCall(
66
name="get_weather",
67
input_parameters={"city": "London", "unit": "celsius"},
68
output={"temp": 18, "condition": "cloudy"}
69
)
70
],
71
expected_tools=[
72
ToolCall(name="get_weather", input_parameters={"city": "New York"}),
73
ToolCall(name="get_weather", input_parameters={"city": "London"})
74
]
75
)
76
77
metric.measure(test_case)
78
79
if metric.success:
80
print("Tools used correctly")
81
else:
82
print(f"Tool usage issues: {metric.reason}")
83
```
84
85
### Task Completion Metric
86
87
Evaluates whether the task was completed successfully based on the goal.
88
89
```python { .api }
90
class TaskCompletionMetric:
91
"""
92
Evaluates whether the task was completed successfully.
93
94
Parameters:
95
- threshold (float): Success threshold (0-1, default: 0.5)
96
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
97
- include_reason (bool): Include reason in output (default: True)
98
99
Required Test Case Parameters:
100
- INPUT (task description)
101
- ACTUAL_OUTPUT (task result)
102
103
Attributes:
104
- score (float): Task completion score (0-1)
105
- reason (str): Explanation of completion status
106
- success (bool): Whether score meets threshold
107
"""
108
```
109
110
Usage example:
111
112
```python
113
from deepeval.metrics import TaskCompletionMetric
114
from deepeval.test_case import LLMTestCase
115
116
metric = TaskCompletionMetric(threshold=0.8)
117
118
test_case = LLMTestCase(
119
input="Book a flight from NYC to LAX for next Monday, find a hotel near LAX, and create an itinerary.",
120
actual_output="I've booked flight UA123 departing Monday 8am, reserved Hilton LAX for 3 nights, and created a 3-day itinerary including beach visits and city tours.",
121
tools_called=[
122
ToolCall(name="book_flight", input_parameters={"from": "NYC", "to": "LAX"}),
123
ToolCall(name="book_hotel", input_parameters={"location": "LAX"}),
124
ToolCall(name="create_itinerary", input_parameters={"days": 3})
125
]
126
)
127
128
metric.measure(test_case)
129
print(f"Task completion: {metric.score:.2f}")
130
```
131
132
### Tool Use Metric
133
134
Evaluates appropriate use of available tools.
135
136
```python { .api }
137
class ToolUseMetric:
138
"""
139
Evaluates appropriate use of available tools.
140
141
Parameters:
142
- threshold (float): Success threshold (0-1, default: 0.5)
143
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
144
- include_reason (bool): Include reason in output (default: True)
145
146
Required Test Case Parameters:
147
- INPUT
148
- ACTUAL_OUTPUT
149
- TOOLS_CALLED
150
151
Attributes:
152
- score (float): Tool use appropriateness score (0-1)
153
- reason (str): Explanation of tool usage
154
- success (bool): Whether score meets threshold
155
"""
156
```
157
158
Usage example:
159
160
```python
161
from deepeval.metrics import ToolUseMetric
162
from deepeval.test_case import LLMTestCase, ToolCall
163
164
metric = ToolUseMetric(threshold=0.7)
165
166
test_case = LLMTestCase(
167
input="Find restaurants near me and check the weather",
168
actual_output="Here are 5 restaurants near you. The weather is sunny, 75°F.",
169
tools_called=[
170
ToolCall(
171
name="search_restaurants",
172
input_parameters={"location": "current"}
173
),
174
ToolCall(
175
name="get_weather",
176
input_parameters={"location": "current"}
177
)
178
]
179
)
180
181
metric.measure(test_case)
182
```
183
184
### Plan Quality Metric
185
186
Evaluates the quality of generated plans.
187
188
```python { .api }
189
class PlanQualityMetric:
190
"""
191
Evaluates the quality of generated plans.
192
193
Parameters:
194
- threshold (float): Success threshold (0-1, default: 0.5)
195
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
196
- include_reason (bool): Include reason in output (default: True)
197
198
Required Test Case Parameters:
199
- INPUT (task/goal)
200
- ACTUAL_OUTPUT (plan)
201
202
Attributes:
203
- score (float): Plan quality score (0-1)
204
- reason (str): Explanation of plan quality
205
- success (bool): Whether score meets threshold
206
"""
207
```
208
209
Usage example:
210
211
```python
212
from deepeval.metrics import PlanQualityMetric
213
from deepeval.test_case import LLMTestCase
214
215
metric = PlanQualityMetric(threshold=0.7)
216
217
test_case = LLMTestCase(
218
input="Create a plan to migrate our database from MySQL to PostgreSQL",
219
actual_output="""
220
Plan:
221
1. Audit current MySQL schema and data
222
2. Set up PostgreSQL instance
223
3. Create migration scripts for schema
224
4. Test migration on staging environment
225
5. Execute migration during maintenance window
226
6. Validate data integrity post-migration
227
7. Monitor performance for 48 hours
228
"""
229
)
230
231
metric.measure(test_case)
232
print(f"Plan quality: {metric.score:.2f}")
233
```
234
235
### Plan Adherence Metric
236
237
Measures adherence to a predefined plan.
238
239
```python { .api }
240
class PlanAdherenceMetric:
241
"""
242
Measures adherence to a predefined plan.
243
244
Parameters:
245
- threshold (float): Success threshold (0-1, default: 0.5)
246
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
247
- include_reason (bool): Include reason in output (default: True)
248
249
Required Test Case Parameters:
250
- EXPECTED_OUTPUT (plan)
251
- ACTUAL_OUTPUT (execution)
252
253
Attributes:
254
- score (float): Plan adherence score (0-1)
255
- reason (str): Explanation of deviations from plan
256
- success (bool): Whether score meets threshold
257
"""
258
```
259
260
Usage example:
261
262
```python
263
from deepeval.metrics import PlanAdherenceMetric
264
from deepeval.test_case import LLMTestCase
265
266
metric = PlanAdherenceMetric(threshold=0.8)
267
268
test_case = LLMTestCase(
269
input="Execute the database migration plan",
270
expected_output="Plan: 1) Backup data, 2) Stop services, 3) Migrate, 4) Validate, 5) Restart",
271
actual_output="Executed: Backed up data, stopped services, ran migration, validated results, restarted services",
272
tools_called=[
273
ToolCall(name="backup_database"),
274
ToolCall(name="stop_services"),
275
ToolCall(name="migrate_database"),
276
ToolCall(name="validate_data"),
277
ToolCall(name="start_services")
278
]
279
)
280
281
metric.measure(test_case)
282
```
283
284
### Step Efficiency Metric
285
286
Evaluates efficiency of steps taken to complete a task.
287
288
```python { .api }
289
class StepEfficiencyMetric:
290
"""
291
Evaluates efficiency of steps taken to complete a task.
292
293
Parameters:
294
- threshold (float): Success threshold (0-1, default: 0.5)
295
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
296
- include_reason (bool): Include reason in output (default: True)
297
298
Required Test Case Parameters:
299
- INPUT (task)
300
- TOOLS_CALLED or ACTUAL_OUTPUT
301
302
Attributes:
303
- score (float): Efficiency score (0-1)
304
- reason (str): Explanation of efficiency issues
305
- success (bool): Whether score meets threshold
306
"""
307
```
308
309
Usage example:
310
311
```python
312
from deepeval.metrics import StepEfficiencyMetric
313
from deepeval.test_case import LLMTestCase, ToolCall
314
315
metric = StepEfficiencyMetric(threshold=0.7)
316
317
# Inefficient agent
318
test_case_inefficient = LLMTestCase(
319
input="What's 2 + 2?",
320
actual_output="4",
321
tools_called=[
322
ToolCall(name="search_web", input_parameters={"query": "2+2"}),
323
ToolCall(name="calculator", input_parameters={"expression": "2+2"}),
324
ToolCall(name="verify_answer", input_parameters={"answer": 4})
325
]
326
)
327
328
metric.measure(test_case_inefficient)
329
# Low score: Too many steps for simple calculation
330
331
# Efficient agent
332
test_case_efficient = LLMTestCase(
333
input="What's 2 + 2?",
334
actual_output="4",
335
tools_called=[
336
ToolCall(name="calculator", input_parameters={"expression": "2+2"})
337
]
338
)
339
340
metric.measure(test_case_efficient)
341
# High score: Direct and efficient
342
```
343
344
### Goal Accuracy Metric
345
346
Measures accuracy in achieving specified goals.
347
348
```python { .api }
349
class GoalAccuracyMetric:
350
"""
351
Measures accuracy in achieving specified goals.
352
353
Parameters:
354
- threshold (float): Success threshold (0-1, default: 0.5)
355
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
356
- include_reason (bool): Include reason in output (default: True)
357
358
Required Test Case Parameters:
359
- INPUT (goal)
360
- ACTUAL_OUTPUT (result)
361
- EXPECTED_OUTPUT (optional, expected result)
362
363
Attributes:
364
- score (float): Goal accuracy score (0-1)
365
- reason (str): Explanation of goal achievement
366
- success (bool): Whether score meets threshold
367
"""
368
```
369
370
Usage example:
371
372
```python
373
from deepeval.metrics import GoalAccuracyMetric
374
from deepeval.test_case import LLMTestCase
375
376
metric = GoalAccuracyMetric(threshold=0.8)
377
378
test_case = LLMTestCase(
379
input="Goal: Increase website traffic by 20% through SEO optimization",
380
actual_output="Implemented SEO changes: optimized meta tags, improved page speed, created quality backlinks. Result: 23% increase in traffic.",
381
expected_output="20% increase in website traffic"
382
)
383
384
metric.measure(test_case)
385
print(f"Goal achievement: {metric.score:.2f}")
386
```
387
388
### MCP Task Completion Metric
389
390
Evaluates task completion using Model Context Protocol (MCP) tools.
391
392
```python { .api }
393
class MCPTaskCompletionMetric:
394
"""
395
Evaluates task completion using MCP.
396
397
Parameters:
398
- threshold (float): Success threshold (0-1, default: 0.5)
399
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
400
401
Required Test Case Parameters:
402
- INPUT
403
- ACTUAL_OUTPUT
404
- MCP_TOOLS_CALLED
405
406
Attributes:
407
- score (float): MCP task completion score (0-1)
408
- reason (str): Explanation of task completion with MCP
409
- success (bool): Whether score meets threshold
410
"""
411
```
412
413
Usage example:
414
415
```python
416
from deepeval.metrics import MCPTaskCompletionMetric
417
from deepeval.test_case import LLMTestCase, MCPToolCall, MCPServer
418
419
metric = MCPTaskCompletionMetric(threshold=0.8)
420
421
test_case = LLMTestCase(
422
input="Search for Python tutorials and save the top 5 results",
423
actual_output="Found and saved 5 Python tutorials",
424
mcp_servers=[
425
MCPServer(
426
server_name="search-server",
427
available_tools=["web_search", "save_results"]
428
)
429
],
430
mcp_tools_called=[
431
MCPToolCall(
432
server_name="search-server",
433
tool_name="web_search",
434
arguments={"query": "Python tutorials", "limit": 5}
435
),
436
MCPToolCall(
437
server_name="search-server",
438
tool_name="save_results",
439
arguments={"results": [...]}
440
)
441
]
442
)
443
444
metric.measure(test_case)
445
```
446
447
### MCP Use Metric
448
449
Evaluates proper MCP usage.
450
451
```python { .api }
452
class MCPUseMetric:
453
"""
454
Evaluates proper MCP usage.
455
456
Parameters:
457
- threshold (float): Success threshold (0-1, default: 0.5)
458
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
459
460
Required Test Case Parameters:
461
- INPUT
462
- MCP_TOOLS_CALLED
463
- MCP_SERVERS
464
465
Attributes:
466
- score (float): MCP usage score (0-1)
467
- reason (str): Explanation of MCP usage
468
- success (bool): Whether score meets threshold
469
"""
470
```
471
472
## Comprehensive Agentic Evaluation
473
474
Evaluate all agentic capabilities:
475
476
```python
477
from deepeval import evaluate
478
from deepeval.metrics import (
479
ToolCorrectnessMetric,
480
TaskCompletionMetric,
481
ToolUseMetric,
482
PlanQualityMetric,
483
StepEfficiencyMetric,
484
GoalAccuracyMetric
485
)
486
from deepeval.test_case import LLMTestCase
487
488
# Create agentic metrics suite
489
agentic_metrics = [
490
ToolCorrectnessMetric(threshold=0.8),
491
TaskCompletionMetric(threshold=0.8),
492
ToolUseMetric(threshold=0.7),
493
PlanQualityMetric(threshold=0.7),
494
StepEfficiencyMetric(threshold=0.7),
495
GoalAccuracyMetric(threshold=0.8)
496
]
497
498
# Test agent on complex task
499
test_cases = [
500
LLMTestCase(
501
input="Plan and execute a customer onboarding workflow",
502
actual_output=agent_execute("Plan and execute a customer onboarding workflow"),
503
tools_called=get_agent_tools_called(),
504
expected_tools=[...]
505
)
506
]
507
508
# Evaluate agent
509
result = evaluate(test_cases, agentic_metrics)
510
511
# Analyze agent performance
512
for test_result in result.test_results:
513
print(f"Agent Performance Report:")
514
for metric_name, metric_result in test_result.metrics.items():
515
status = "✓" if metric_result.success else "✗"
516
print(f" {status} {metric_name}: {metric_result.score:.2f}")
517
if not metric_result.success:
518
print(f" Issue: {metric_result.reason}")
519
```
520
521
## Multi-Step Agent Evaluation
522
523
Evaluate agents on complex multi-step tasks:
524
525
```python
526
from deepeval.test_case import LLMTestCase, ToolCall
527
from deepeval.metrics import (
528
TaskCompletionMetric,
529
ToolCorrectnessMetric,
530
StepEfficiencyMetric
531
)
532
533
# Complex multi-step task
534
test_case = LLMTestCase(
535
input="""
536
Research the top 3 competitors in the AI LLM space,
537
create a comparison table of their features,
538
and draft an executive summary
539
""",
540
actual_output="""
541
Researched OpenAI, Anthropic, and Google.
542
Created comparison table with pricing, capabilities, and APIs.
543
Executive summary: [detailed summary]
544
""",
545
tools_called=[
546
ToolCall(name="web_search", input_parameters={"query": "AI LLM providers"}),
547
ToolCall(name="web_search", input_parameters={"query": "OpenAI features"}),
548
ToolCall(name="web_search", input_parameters={"query": "Anthropic features"}),
549
ToolCall(name="web_search", input_parameters={"query": "Google AI features"}),
550
ToolCall(name="create_table", input_parameters={"data": [...]}),
551
ToolCall(name="generate_summary", input_parameters={"content": [...]})
552
]
553
)
554
555
# Evaluate with multiple metrics
556
metrics = [
557
TaskCompletionMetric(threshold=0.8),
558
StepEfficiencyMetric(threshold=0.7),
559
ToolCorrectnessMetric(threshold=0.8)
560
]
561
562
for metric in metrics:
563
metric.measure(test_case)
564
print(f"{metric.__class__.__name__}: {metric.score:.2f} - {metric.reason}")
565
```
566