or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agentic-metrics.mdbenchmarks.mdcontent-quality-metrics.mdconversational-metrics.mdcore-evaluation.mdcustom-metrics.mddataset.mdindex.mdintegrations.mdmodels.mdmultimodal-metrics.mdrag-metrics.mdsynthesizer.mdtest-cases.mdtracing.md

test-cases.mddocs/

0

# Test Cases

1

2

Test cases are structured containers representing LLM interactions to be evaluated. DeepEval provides specialized test case classes for different evaluation scenarios: standard LLM tests, multi-turn conversations, multimodal inputs, and arena-style comparisons.

3

4

## Imports

5

6

```python

7

from deepeval.test_case import (

8

LLMTestCase,

9

LLMTestCaseParams,

10

ConversationalTestCase,

11

Turn,

12

TurnParams,

13

MLLMTestCase,

14

MLLMImage,

15

MLLMTestCaseParams,

16

ArenaTestCase,

17

Arena,

18

ToolCall,

19

ToolCallParams,

20

MCPServer,

21

MCPToolCall,

22

MCPPromptCall,

23

MCPResourceCall

24

)

25

```

26

27

## Capabilities

28

29

### LLM Test Case

30

31

Standard test case for evaluating single LLM interactions, supporting inputs, outputs, context, and tool usage.

32

33

```python { .api }

34

class LLMTestCase:

35

"""

36

Represents a test case for evaluating LLM outputs.

37

38

Parameters:

39

- input (str): Input prompt to the LLM

40

- actual_output (str, optional): Actual output from the LLM

41

- expected_output (str, optional): Expected output

42

- context (List[str], optional): Context information

43

- retrieval_context (List[str], optional): Retrieved context for RAG applications

44

- additional_metadata (Dict, optional): Additional metadata

45

- tools_called (List[ToolCall], optional): Tools called by the LLM

46

- expected_tools (List[ToolCall], optional): Expected tools to be called

47

- comments (str, optional): Comments about the test case

48

- token_cost (float, optional): Cost in tokens

49

- completion_time (float, optional): Time to complete in seconds

50

- name (str, optional): Name of the test case

51

- tags (List[str], optional): Tags for organization

52

- mcp_servers (List[MCPServer], optional): MCP servers configuration

53

- mcp_tools_called (List[MCPToolCall], optional): MCP tools called

54

- mcp_resources_called (List[MCPResourceCall], optional): MCP resources called

55

- mcp_prompts_called (List[MCPPromptCall], optional): MCP prompts called

56

"""

57

```

58

59

Usage example:

60

61

```python

62

from deepeval.test_case import LLMTestCase

63

64

# Basic test case

65

test_case = LLMTestCase(

66

input="What is the capital of France?",

67

actual_output="The capital of France is Paris.",

68

expected_output="Paris"

69

)

70

71

# RAG test case with retrieval context

72

rag_test_case = LLMTestCase(

73

input="What's our refund policy?",

74

actual_output="We offer a 30-day full refund at no extra cost.",

75

expected_output="30-day full refund policy",

76

retrieval_context=[

77

"All customers are eligible for a 30 day full refund at no extra costs.",

78

"Refunds are processed within 5-7 business days."

79

],

80

context=["Customer support FAQ"]

81

)

82

83

# Agentic test case with tool calls

84

agentic_test_case = LLMTestCase(

85

input="What's the weather in New York?",

86

actual_output="The current weather in New York is 72°F and sunny.",

87

tools_called=[

88

ToolCall(

89

name="get_weather",

90

input_parameters={"location": "New York", "unit": "fahrenheit"},

91

output={"temperature": 72, "condition": "sunny"}

92

)

93

],

94

expected_tools=[

95

ToolCall(name="get_weather", input_parameters={"location": "New York"})

96

]

97

)

98

```

99

100

### LLM Test Case Parameters

101

102

Enumeration of test case parameters for use with metrics.

103

104

```python { .api }

105

class LLMTestCaseParams:

106

"""

107

Enumeration of test case parameters.

108

109

Values:

110

- INPUT: "input"

111

- ACTUAL_OUTPUT: "actual_output"

112

- EXPECTED_OUTPUT: "expected_output"

113

- CONTEXT: "context"

114

- RETRIEVAL_CONTEXT: "retrieval_context"

115

- TOOLS_CALLED: "tools_called"

116

- EXPECTED_TOOLS: "expected_tools"

117

- MCP_SERVERS: "mcp_servers"

118

- MCP_TOOLS_CALLED: "mcp_tools_called"

119

- MCP_RESOURCES_CALLED: "mcp_resources_called"

120

- MCP_PROMPTS_CALLED: "mcp_prompts_called"

121

"""

122

```

123

124

Usage example:

125

126

```python

127

from deepeval.metrics import GEval

128

from deepeval.test_case import LLMTestCaseParams

129

130

# Use params to specify what to evaluate

131

metric = GEval(

132

name="Answer Relevancy",

133

criteria="Determine if the actual output is relevant to the input.",

134

evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]

135

)

136

```

137

138

### Tool Call

139

140

Represents a tool call made by an LLM or expected to be called.

141

142

```python { .api }

143

class ToolCall:

144

"""

145

Represents a tool call made by an LLM.

146

147

Parameters:

148

- name (str): Name of the tool

149

- description (str, optional): Description of the tool

150

- reasoning (str, optional): Reasoning for calling the tool

151

- output (Any, optional): Output from the tool

152

- input_parameters (Dict[str, Any], optional): Input parameters to the tool

153

"""

154

```

155

156

Usage example:

157

158

```python

159

from deepeval.test_case import ToolCall

160

161

# Define a tool call

162

tool_call = ToolCall(

163

name="search_database",

164

description="Searches the product database",

165

reasoning="Need to find product information",

166

input_parameters={

167

"query": "wireless headphones",

168

"max_results": 10

169

},

170

output=[

171

{"id": 1, "name": "Premium Wireless Headphones"},

172

{"id": 2, "name": "Budget Wireless Headphones"}

173

]

174

)

175

```

176

177

### Tool Call Parameters

178

179

Enumeration of tool call parameters.

180

181

```python { .api }

182

class ToolCallParams:

183

"""

184

Enumeration of tool call parameters.

185

186

Values:

187

- INPUT_PARAMETERS: "input_parameters"

188

- OUTPUT: "output"

189

"""

190

```

191

192

### Conversational Test Case

193

194

Test case for evaluating multi-turn conversational interactions.

195

196

```python { .api }

197

class ConversationalTestCase:

198

"""

199

Represents a multi-turn conversational test case.

200

201

Parameters:

202

- turns (List[Turn]): List of conversation turns

203

- scenario (str, optional): Scenario description

204

- context (List[str], optional): Context information

205

- name (str, optional): Name of the test case

206

- user_description (str, optional): Description of the user

207

- expected_outcome (str, optional): Expected outcome of the conversation

208

- chatbot_role (str, optional): Role of the chatbot

209

- additional_metadata (Dict, optional): Additional metadata

210

- comments (str, optional): Comments

211

- tags (List[str], optional): Tags for organization

212

- mcp_servers (List[MCPServer], optional): MCP servers configuration

213

"""

214

```

215

216

Usage example:

217

218

```python

219

from deepeval.test_case import ConversationalTestCase, Turn

220

221

# Multi-turn customer support conversation

222

conversation = ConversationalTestCase(

223

scenario="Customer inquiring about product return",

224

chatbot_role="Customer support agent",

225

user_description="Customer who wants to return a product",

226

expected_outcome="Customer understands return process and is satisfied",

227

context=["30-day return policy", "Free return shipping"],

228

turns=[

229

Turn(

230

role="user",

231

content="I want to return my purchase"

232

),

233

Turn(

234

role="assistant",

235

content="I'd be happy to help with your return. Can you provide your order number?"

236

),

237

Turn(

238

role="user",

239

content="My order number is #12345"

240

),

241

Turn(

242

role="assistant",

243

content="Thank you. I've initiated your return. You'll receive a prepaid return label via email within 24 hours.",

244

retrieval_context=["Order #12345 placed on 2024-01-15"]

245

)

246

]

247

)

248

```

249

250

### Turn

251

252

Represents a single turn in a conversation.

253

254

```python { .api }

255

class Turn:

256

"""

257

Represents a single turn in a conversation.

258

259

Parameters:

260

- role (Literal["user", "assistant"]): Role of the speaker

261

- content (str): Content of the turn

262

- user_id (str, optional): User identifier

263

- retrieval_context (List[str], optional): Retrieved context for this turn

264

- tools_called (List[ToolCall], optional): Tools called during this turn

265

- mcp_tools_called (List[MCPToolCall], optional): MCP tools called

266

- mcp_resources_called (List[MCPResourceCall], optional): MCP resources called

267

- mcp_prompts_called (List[MCPPromptCall], optional): MCP prompts called

268

- additional_metadata (Dict, optional): Additional metadata

269

"""

270

```

271

272

Usage example:

273

274

```python

275

from deepeval.test_case import Turn, ToolCall

276

277

# Assistant turn with tool usage

278

turn = Turn(

279

role="assistant",

280

content="I've checked the weather for you. It's currently 72°F and sunny in New York.",

281

tools_called=[

282

ToolCall(

283

name="get_weather",

284

input_parameters={"city": "New York"},

285

output={"temp": 72, "condition": "sunny"}

286

)

287

],

288

retrieval_context=["User prefers Fahrenheit for temperature"]

289

)

290

```

291

292

### Turn Parameters

293

294

Enumeration of turn parameters for use with conversational metrics.

295

296

```python { .api }

297

class TurnParams:

298

"""

299

Enumeration of turn parameters.

300

301

Values:

302

- ROLE: "role"

303

- CONTENT: "content"

304

- SCENARIO: "scenario"

305

- EXPECTED_OUTCOME: "expected_outcome"

306

- RETRIEVAL_CONTEXT: "retrieval_context"

307

- TOOLS_CALLED: "tools_called"

308

- MCP_TOOLS: "mcp_tools_called"

309

- MCP_RESOURCES: "mcp_resources_called"

310

- MCP_PROMPTS: "mcp_prompts_called"

311

"""

312

```

313

314

### Multimodal LLM Test Case

315

316

Test case for evaluating multimodal LLM interactions involving text and images.

317

318

```python { .api }

319

class MLLMTestCase:

320

"""

321

Represents a test case for multimodal LLMs (text + images).

322

323

Parameters:

324

- input (List[Union[str, MLLMImage]]): Input with text and images

325

- actual_output (List[Union[str, MLLMImage]]): Actual output

326

- expected_output (List[Union[str, MLLMImage]], optional): Expected output

327

- context (List[Union[str, MLLMImage]], optional): Context

328

- retrieval_context (List[Union[str, MLLMImage]], optional): Retrieved context

329

- additional_metadata (Dict, optional): Additional metadata

330

- comments (str, optional): Comments

331

- tools_called (List[ToolCall], optional): Tools called

332

- expected_tools (List[ToolCall], optional): Expected tools

333

- token_cost (float, optional): Token cost

334

- completion_time (float, optional): Completion time in seconds

335

- name (str, optional): Name

336

"""

337

```

338

339

Usage example:

340

341

```python

342

from deepeval.test_case import MLLMTestCase, MLLMImage

343

344

# Image description test case

345

mllm_test_case = MLLMTestCase(

346

input=[

347

"Describe what you see in this image:",

348

MLLMImage(url="path/to/image.jpg", local=True)

349

],

350

actual_output=["A golden retriever playing in a park with a red ball."],

351

expected_output=["A dog playing with a ball in a park."]

352

)

353

354

# Visual question answering

355

vqa_test_case = MLLMTestCase(

356

input=[

357

"What color is the car in the image?",

358

MLLMImage(url="https://example.com/car.jpg")

359

],

360

actual_output=["The car is red."],

361

expected_output=["Red"]

362

)

363

```

364

365

### MLLM Image

366

367

Represents an image in a multimodal test case.

368

369

```python { .api }

370

class MLLMImage:

371

"""

372

Represents an image in a multimodal test case.

373

374

Parameters:

375

- url (str): URL or file path to the image

376

- local (bool, optional): Whether the image is local (default: False)

377

378

Computed Attributes (only populated for local images):

379

- filename (Optional[str]): Filename extracted from URL

380

- mimeType (Optional[str]): MIME type of the image

381

- dataBase64 (Optional[str]): Base64 encoded image data

382

383

Static Methods:

384

- process_url(url: str) -> str: Processes a URL and returns the processed path

385

- is_local_path(url: str) -> bool: Determines if a URL is a local file path

386

"""

387

```

388

389

Usage example:

390

391

```python

392

from deepeval.test_case import MLLMImage

393

394

# Local image

395

local_image = MLLMImage(

396

url="/path/to/local/image.png",

397

local=True

398

)

399

400

# Remote image

401

remote_image = MLLMImage(

402

url="https://example.com/image.jpg"

403

)

404

```

405

406

### MLLM Test Case Parameters

407

408

Enumeration of multimodal test case parameters.

409

410

```python { .api }

411

class MLLMTestCaseParams:

412

"""

413

Enumeration of multimodal test case parameters.

414

415

Values:

416

- INPUT: "input"

417

- ACTUAL_OUTPUT: "actual_output"

418

- EXPECTED_OUTPUT: "expected_output"

419

- CONTEXT: "context"

420

- RETRIEVAL_CONTEXT: "retrieval_context"

421

- TOOLS_CALLED: "tools_called"

422

- EXPECTED_TOOLS: "expected_tools"

423

"""

424

```

425

426

### Arena Test Case

427

428

Test case for comparing multiple LLM outputs in arena-style evaluation.

429

430

```python { .api }

431

class ArenaTestCase:

432

"""

433

Represents a test case for comparing multiple LLM outputs (arena-style).

434

435

Parameters:

436

- contestants (Dict[str, LLMTestCase]): Dictionary mapping contestant names to test cases

437

"""

438

```

439

440

Usage example:

441

442

```python

443

from deepeval.test_case import ArenaTestCase, LLMTestCase

444

from deepeval.metrics import ArenaGEval

445

446

# Compare outputs from different models

447

arena_test = ArenaTestCase(

448

contestants={

449

"gpt-4": LLMTestCase(

450

input="Write a haiku about coding",

451

actual_output="Lines of code flow\\nBugs emerge, then disappear\\nSoftware takes its form"

452

),

453

"claude-3": LLMTestCase(

454

input="Write a haiku about coding",

455

actual_output="Keys click through the night\\nAlgorithms come alive\\nCode compiles at dawn"

456

),

457

"gemini-pro": LLMTestCase(

458

input="Write a haiku about coding",

459

actual_output="Functions nested deep\\nVariables dance in loops\\nPrograms bloom to life"

460

)

461

}

462

)

463

464

# Evaluate which is best

465

arena_metric = ArenaGEval(

466

name="Haiku Quality",

467

criteria="Determine which haiku best captures the essence of coding"

468

)

469

arena_metric.measure(arena_test)

470

print(f"Winner: {arena_metric.winner}") # Returns name of winning contestant

471

```

472

473

### Arena

474

475

Container for multiple arena test cases.

476

477

```python { .api }

478

class Arena:

479

"""

480

Container for managing multiple arena test cases.

481

482

Parameters:

483

- test_cases (List[ArenaTestCase]): List of arena test cases to manage

484

"""

485

```

486

487

Usage example:

488

489

```python

490

from deepeval.test_case import Arena, ArenaTestCase, LLMTestCase

491

492

# Create multiple arena test cases

493

arena = Arena(test_cases=[

494

ArenaTestCase(contestants={

495

"model-a": LLMTestCase(input="Question 1", actual_output="Answer A1"),

496

"model-b": LLMTestCase(input="Question 1", actual_output="Answer B1")

497

}),

498

ArenaTestCase(contestants={

499

"model-a": LLMTestCase(input="Question 2", actual_output="Answer A2"),

500

"model-b": LLMTestCase(input="Question 2", actual_output="Answer B2")

501

})

502

])

503

```

504

505

### MCP Types

506

507

Model Context Protocol (MCP) support for advanced tool and resource management.

508

509

```python { .api }

510

class MCPServer:

511

"""

512

Represents an MCP (Model Context Protocol) server configuration.

513

514

Parameters:

515

- server_name (str): Name of the server

516

- transport (Literal["stdio", "sse", "streamable-http"], optional): Transport protocol

517

- available_tools (List, optional): Available tools

518

- available_resources (List, optional): Available resources

519

- available_prompts (List, optional): Available prompts

520

"""

521

522

class MCPToolCall(BaseModel):

523

"""

524

Represents an MCP tool call.

525

526

Parameters:

527

- name (str): Name of the tool

528

- args (Dict): Tool arguments

529

- result (object): Tool execution result

530

"""

531

532

class MCPResourceCall(BaseModel):

533

"""

534

Represents an MCP resource call.

535

536

Parameters:

537

- uri (AnyUrl): URI of the resource (pydantic AnyUrl type)

538

- result (object): Resource retrieval result

539

"""

540

541

class MCPPromptCall(BaseModel):

542

"""

543

Represents an MCP prompt call.

544

545

Parameters:

546

- name (str): Name of the prompt

547

- result (object): Prompt execution result

548

"""

549

```

550

551

Usage example:

552

553

```python

554

from deepeval.test_case import LLMTestCase, MCPServer, MCPToolCall

555

556

# Test case with MCP server usage

557

mcp_test_case = LLMTestCase(

558

input="Search for Python tutorials",

559

actual_output="Here are the top Python tutorials I found...",

560

mcp_servers=[

561

MCPServer(

562

server_name="search-server",

563

transport="stdio",

564

available_tools=["web_search", "database_query"]

565

)

566

],

567

mcp_tools_called=[

568

MCPToolCall(

569

name="web_search",

570

args={"query": "Python tutorials", "limit": 10},

571

result={"count": 10, "results": [...]}

572

)

573

]

574

)

575

```

576