or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agentic-metrics.mdbenchmarks.mdcontent-quality-metrics.mdconversational-metrics.mdcore-evaluation.mdcustom-metrics.mddataset.mdindex.mdintegrations.mdmodels.mdmultimodal-metrics.mdrag-metrics.mdsynthesizer.mdtest-cases.mdtracing.md

agentic-metrics.mddocs/

0

# Agentic Metrics

1

2

Metrics for evaluating AI agents, including tool usage, task completion, plan quality, and goal achievement. These metrics assess how well agents perform complex, multi-step tasks.

3

4

## Imports

5

6

```python

7

from deepeval.metrics import (

8

ToolCorrectnessMetric,

9

TaskCompletionMetric,

10

ToolUseMetric,

11

PlanQualityMetric,

12

PlanAdherenceMetric,

13

StepEfficiencyMetric,

14

GoalAccuracyMetric,

15

MCPTaskCompletionMetric,

16

MCPUseMetric

17

)

18

```

19

20

## Capabilities

21

22

### Tool Correctness Metric

23

24

Evaluates whether the correct tools were called with correct parameters.

25

26

```python { .api }

27

class ToolCorrectnessMetric:

28

"""

29

Evaluates whether the correct tools were called with correct parameters.

30

31

Parameters:

32

- threshold (float): Success threshold (0-1, default: 0.5)

33

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

34

- include_reason (bool): Include reason in output (default: True)

35

- async_mode (bool): Async mode (default: True)

36

37

Required Test Case Parameters:

38

- TOOLS_CALLED

39

- EXPECTED_TOOLS

40

41

Attributes:

42

- score (float): Tool correctness score (0-1)

43

- reason (str): Explanation of tool usage issues

44

- success (bool): Whether score meets threshold

45

"""

46

```

47

48

Usage example:

49

50

```python

51

from deepeval.metrics import ToolCorrectnessMetric

52

from deepeval.test_case import LLMTestCase, ToolCall

53

54

metric = ToolCorrectnessMetric(threshold=0.8)

55

56

test_case = LLMTestCase(

57

input="What's the weather in New York and London?",

58

actual_output="New York: 72°F, sunny. London: 18°C, cloudy.",

59

tools_called=[

60

ToolCall(

61

name="get_weather",

62

input_parameters={"city": "New York", "unit": "fahrenheit"},

63

output={"temp": 72, "condition": "sunny"}

64

),

65

ToolCall(

66

name="get_weather",

67

input_parameters={"city": "London", "unit": "celsius"},

68

output={"temp": 18, "condition": "cloudy"}

69

)

70

],

71

expected_tools=[

72

ToolCall(name="get_weather", input_parameters={"city": "New York"}),

73

ToolCall(name="get_weather", input_parameters={"city": "London"})

74

]

75

)

76

77

metric.measure(test_case)

78

79

if metric.success:

80

print("Tools used correctly")

81

else:

82

print(f"Tool usage issues: {metric.reason}")

83

```

84

85

### Task Completion Metric

86

87

Evaluates whether the task was completed successfully based on the goal.

88

89

```python { .api }

90

class TaskCompletionMetric:

91

"""

92

Evaluates whether the task was completed successfully.

93

94

Parameters:

95

- threshold (float): Success threshold (0-1, default: 0.5)

96

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

97

- include_reason (bool): Include reason in output (default: True)

98

99

Required Test Case Parameters:

100

- INPUT (task description)

101

- ACTUAL_OUTPUT (task result)

102

103

Attributes:

104

- score (float): Task completion score (0-1)

105

- reason (str): Explanation of completion status

106

- success (bool): Whether score meets threshold

107

"""

108

```

109

110

Usage example:

111

112

```python

113

from deepeval.metrics import TaskCompletionMetric

114

from deepeval.test_case import LLMTestCase

115

116

metric = TaskCompletionMetric(threshold=0.8)

117

118

test_case = LLMTestCase(

119

input="Book a flight from NYC to LAX for next Monday, find a hotel near LAX, and create an itinerary.",

120

actual_output="I've booked flight UA123 departing Monday 8am, reserved Hilton LAX for 3 nights, and created a 3-day itinerary including beach visits and city tours.",

121

tools_called=[

122

ToolCall(name="book_flight", input_parameters={"from": "NYC", "to": "LAX"}),

123

ToolCall(name="book_hotel", input_parameters={"location": "LAX"}),

124

ToolCall(name="create_itinerary", input_parameters={"days": 3})

125

]

126

)

127

128

metric.measure(test_case)

129

print(f"Task completion: {metric.score:.2f}")

130

```

131

132

### Tool Use Metric

133

134

Evaluates appropriate use of available tools.

135

136

```python { .api }

137

class ToolUseMetric:

138

"""

139

Evaluates appropriate use of available tools.

140

141

Parameters:

142

- threshold (float): Success threshold (0-1, default: 0.5)

143

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

144

- include_reason (bool): Include reason in output (default: True)

145

146

Required Test Case Parameters:

147

- INPUT

148

- ACTUAL_OUTPUT

149

- TOOLS_CALLED

150

151

Attributes:

152

- score (float): Tool use appropriateness score (0-1)

153

- reason (str): Explanation of tool usage

154

- success (bool): Whether score meets threshold

155

"""

156

```

157

158

Usage example:

159

160

```python

161

from deepeval.metrics import ToolUseMetric

162

from deepeval.test_case import LLMTestCase, ToolCall

163

164

metric = ToolUseMetric(threshold=0.7)

165

166

test_case = LLMTestCase(

167

input="Find restaurants near me and check the weather",

168

actual_output="Here are 5 restaurants near you. The weather is sunny, 75°F.",

169

tools_called=[

170

ToolCall(

171

name="search_restaurants",

172

input_parameters={"location": "current"}

173

),

174

ToolCall(

175

name="get_weather",

176

input_parameters={"location": "current"}

177

)

178

]

179

)

180

181

metric.measure(test_case)

182

```

183

184

### Plan Quality Metric

185

186

Evaluates the quality of generated plans.

187

188

```python { .api }

189

class PlanQualityMetric:

190

"""

191

Evaluates the quality of generated plans.

192

193

Parameters:

194

- threshold (float): Success threshold (0-1, default: 0.5)

195

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

196

- include_reason (bool): Include reason in output (default: True)

197

198

Required Test Case Parameters:

199

- INPUT (task/goal)

200

- ACTUAL_OUTPUT (plan)

201

202

Attributes:

203

- score (float): Plan quality score (0-1)

204

- reason (str): Explanation of plan quality

205

- success (bool): Whether score meets threshold

206

"""

207

```

208

209

Usage example:

210

211

```python

212

from deepeval.metrics import PlanQualityMetric

213

from deepeval.test_case import LLMTestCase

214

215

metric = PlanQualityMetric(threshold=0.7)

216

217

test_case = LLMTestCase(

218

input="Create a plan to migrate our database from MySQL to PostgreSQL",

219

actual_output="""

220

Plan:

221

1. Audit current MySQL schema and data

222

2. Set up PostgreSQL instance

223

3. Create migration scripts for schema

224

4. Test migration on staging environment

225

5. Execute migration during maintenance window

226

6. Validate data integrity post-migration

227

7. Monitor performance for 48 hours

228

"""

229

)

230

231

metric.measure(test_case)

232

print(f"Plan quality: {metric.score:.2f}")

233

```

234

235

### Plan Adherence Metric

236

237

Measures adherence to a predefined plan.

238

239

```python { .api }

240

class PlanAdherenceMetric:

241

"""

242

Measures adherence to a predefined plan.

243

244

Parameters:

245

- threshold (float): Success threshold (0-1, default: 0.5)

246

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

247

- include_reason (bool): Include reason in output (default: True)

248

249

Required Test Case Parameters:

250

- EXPECTED_OUTPUT (plan)

251

- ACTUAL_OUTPUT (execution)

252

253

Attributes:

254

- score (float): Plan adherence score (0-1)

255

- reason (str): Explanation of deviations from plan

256

- success (bool): Whether score meets threshold

257

"""

258

```

259

260

Usage example:

261

262

```python

263

from deepeval.metrics import PlanAdherenceMetric

264

from deepeval.test_case import LLMTestCase

265

266

metric = PlanAdherenceMetric(threshold=0.8)

267

268

test_case = LLMTestCase(

269

input="Execute the database migration plan",

270

expected_output="Plan: 1) Backup data, 2) Stop services, 3) Migrate, 4) Validate, 5) Restart",

271

actual_output="Executed: Backed up data, stopped services, ran migration, validated results, restarted services",

272

tools_called=[

273

ToolCall(name="backup_database"),

274

ToolCall(name="stop_services"),

275

ToolCall(name="migrate_database"),

276

ToolCall(name="validate_data"),

277

ToolCall(name="start_services")

278

]

279

)

280

281

metric.measure(test_case)

282

```

283

284

### Step Efficiency Metric

285

286

Evaluates efficiency of steps taken to complete a task.

287

288

```python { .api }

289

class StepEfficiencyMetric:

290

"""

291

Evaluates efficiency of steps taken to complete a task.

292

293

Parameters:

294

- threshold (float): Success threshold (0-1, default: 0.5)

295

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

296

- include_reason (bool): Include reason in output (default: True)

297

298

Required Test Case Parameters:

299

- INPUT (task)

300

- TOOLS_CALLED or ACTUAL_OUTPUT

301

302

Attributes:

303

- score (float): Efficiency score (0-1)

304

- reason (str): Explanation of efficiency issues

305

- success (bool): Whether score meets threshold

306

"""

307

```

308

309

Usage example:

310

311

```python

312

from deepeval.metrics import StepEfficiencyMetric

313

from deepeval.test_case import LLMTestCase, ToolCall

314

315

metric = StepEfficiencyMetric(threshold=0.7)

316

317

# Inefficient agent

318

test_case_inefficient = LLMTestCase(

319

input="What's 2 + 2?",

320

actual_output="4",

321

tools_called=[

322

ToolCall(name="search_web", input_parameters={"query": "2+2"}),

323

ToolCall(name="calculator", input_parameters={"expression": "2+2"}),

324

ToolCall(name="verify_answer", input_parameters={"answer": 4})

325

]

326

)

327

328

metric.measure(test_case_inefficient)

329

# Low score: Too many steps for simple calculation

330

331

# Efficient agent

332

test_case_efficient = LLMTestCase(

333

input="What's 2 + 2?",

334

actual_output="4",

335

tools_called=[

336

ToolCall(name="calculator", input_parameters={"expression": "2+2"})

337

]

338

)

339

340

metric.measure(test_case_efficient)

341

# High score: Direct and efficient

342

```

343

344

### Goal Accuracy Metric

345

346

Measures accuracy in achieving specified goals.

347

348

```python { .api }

349

class GoalAccuracyMetric:

350

"""

351

Measures accuracy in achieving specified goals.

352

353

Parameters:

354

- threshold (float): Success threshold (0-1, default: 0.5)

355

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

356

- include_reason (bool): Include reason in output (default: True)

357

358

Required Test Case Parameters:

359

- INPUT (goal)

360

- ACTUAL_OUTPUT (result)

361

- EXPECTED_OUTPUT (optional, expected result)

362

363

Attributes:

364

- score (float): Goal accuracy score (0-1)

365

- reason (str): Explanation of goal achievement

366

- success (bool): Whether score meets threshold

367

"""

368

```

369

370

Usage example:

371

372

```python

373

from deepeval.metrics import GoalAccuracyMetric

374

from deepeval.test_case import LLMTestCase

375

376

metric = GoalAccuracyMetric(threshold=0.8)

377

378

test_case = LLMTestCase(

379

input="Goal: Increase website traffic by 20% through SEO optimization",

380

actual_output="Implemented SEO changes: optimized meta tags, improved page speed, created quality backlinks. Result: 23% increase in traffic.",

381

expected_output="20% increase in website traffic"

382

)

383

384

metric.measure(test_case)

385

print(f"Goal achievement: {metric.score:.2f}")

386

```

387

388

### MCP Task Completion Metric

389

390

Evaluates task completion using Model Context Protocol (MCP) tools.

391

392

```python { .api }

393

class MCPTaskCompletionMetric:

394

"""

395

Evaluates task completion using MCP.

396

397

Parameters:

398

- threshold (float): Success threshold (0-1, default: 0.5)

399

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

400

401

Required Test Case Parameters:

402

- INPUT

403

- ACTUAL_OUTPUT

404

- MCP_TOOLS_CALLED

405

406

Attributes:

407

- score (float): MCP task completion score (0-1)

408

- reason (str): Explanation of task completion with MCP

409

- success (bool): Whether score meets threshold

410

"""

411

```

412

413

Usage example:

414

415

```python

416

from deepeval.metrics import MCPTaskCompletionMetric

417

from deepeval.test_case import LLMTestCase, MCPToolCall, MCPServer

418

419

metric = MCPTaskCompletionMetric(threshold=0.8)

420

421

test_case = LLMTestCase(

422

input="Search for Python tutorials and save the top 5 results",

423

actual_output="Found and saved 5 Python tutorials",

424

mcp_servers=[

425

MCPServer(

426

server_name="search-server",

427

available_tools=["web_search", "save_results"]

428

)

429

],

430

mcp_tools_called=[

431

MCPToolCall(

432

server_name="search-server",

433

tool_name="web_search",

434

arguments={"query": "Python tutorials", "limit": 5}

435

),

436

MCPToolCall(

437

server_name="search-server",

438

tool_name="save_results",

439

arguments={"results": [...]}

440

)

441

]

442

)

443

444

metric.measure(test_case)

445

```

446

447

### MCP Use Metric

448

449

Evaluates proper MCP usage.

450

451

```python { .api }

452

class MCPUseMetric:

453

"""

454

Evaluates proper MCP usage.

455

456

Parameters:

457

- threshold (float): Success threshold (0-1, default: 0.5)

458

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

459

460

Required Test Case Parameters:

461

- INPUT

462

- MCP_TOOLS_CALLED

463

- MCP_SERVERS

464

465

Attributes:

466

- score (float): MCP usage score (0-1)

467

- reason (str): Explanation of MCP usage

468

- success (bool): Whether score meets threshold

469

"""

470

```

471

472

## Comprehensive Agentic Evaluation

473

474

Evaluate all agentic capabilities:

475

476

```python

477

from deepeval import evaluate

478

from deepeval.metrics import (

479

ToolCorrectnessMetric,

480

TaskCompletionMetric,

481

ToolUseMetric,

482

PlanQualityMetric,

483

StepEfficiencyMetric,

484

GoalAccuracyMetric

485

)

486

from deepeval.test_case import LLMTestCase

487

488

# Create agentic metrics suite

489

agentic_metrics = [

490

ToolCorrectnessMetric(threshold=0.8),

491

TaskCompletionMetric(threshold=0.8),

492

ToolUseMetric(threshold=0.7),

493

PlanQualityMetric(threshold=0.7),

494

StepEfficiencyMetric(threshold=0.7),

495

GoalAccuracyMetric(threshold=0.8)

496

]

497

498

# Test agent on complex task

499

test_cases = [

500

LLMTestCase(

501

input="Plan and execute a customer onboarding workflow",

502

actual_output=agent_execute("Plan and execute a customer onboarding workflow"),

503

tools_called=get_agent_tools_called(),

504

expected_tools=[...]

505

)

506

]

507

508

# Evaluate agent

509

result = evaluate(test_cases, agentic_metrics)

510

511

# Analyze agent performance

512

for test_result in result.test_results:

513

print(f"Agent Performance Report:")

514

for metric_name, metric_result in test_result.metrics.items():

515

status = "✓" if metric_result.success else "✗"

516

print(f" {status} {metric_name}: {metric_result.score:.2f}")

517

if not metric_result.success:

518

print(f" Issue: {metric_result.reason}")

519

```

520

521

## Multi-Step Agent Evaluation

522

523

Evaluate agents on complex multi-step tasks:

524

525

```python

526

from deepeval.test_case import LLMTestCase, ToolCall

527

from deepeval.metrics import (

528

TaskCompletionMetric,

529

ToolCorrectnessMetric,

530

StepEfficiencyMetric

531

)

532

533

# Complex multi-step task

534

test_case = LLMTestCase(

535

input="""

536

Research the top 3 competitors in the AI LLM space,

537

create a comparison table of their features,

538

and draft an executive summary

539

""",

540

actual_output="""

541

Researched OpenAI, Anthropic, and Google.

542

Created comparison table with pricing, capabilities, and APIs.

543

Executive summary: [detailed summary]

544

""",

545

tools_called=[

546

ToolCall(name="web_search", input_parameters={"query": "AI LLM providers"}),

547

ToolCall(name="web_search", input_parameters={"query": "OpenAI features"}),

548

ToolCall(name="web_search", input_parameters={"query": "Anthropic features"}),

549

ToolCall(name="web_search", input_parameters={"query": "Google AI features"}),

550

ToolCall(name="create_table", input_parameters={"data": [...]}),

551

ToolCall(name="generate_summary", input_parameters={"content": [...]})

552

]

553

)

554

555

# Evaluate with multiple metrics

556

metrics = [

557

TaskCompletionMetric(threshold=0.8),

558

StepEfficiencyMetric(threshold=0.7),

559

ToolCorrectnessMetric(threshold=0.8)

560

]

561

562

for metric in metrics:

563

metric.measure(test_case)

564

print(f"{metric.__class__.__name__}: {metric.score:.2f} - {metric.reason}")

565

```

566