0
# Test Cases
1
2
Test cases are structured containers representing LLM interactions to be evaluated. DeepEval provides specialized test case classes for different evaluation scenarios: standard LLM tests, multi-turn conversations, multimodal inputs, and arena-style comparisons.
3
4
## Imports
5
6
```python
7
from deepeval.test_case import (
8
LLMTestCase,
9
LLMTestCaseParams,
10
ConversationalTestCase,
11
Turn,
12
TurnParams,
13
MLLMTestCase,
14
MLLMImage,
15
MLLMTestCaseParams,
16
ArenaTestCase,
17
Arena,
18
ToolCall,
19
ToolCallParams,
20
MCPServer,
21
MCPToolCall,
22
MCPPromptCall,
23
MCPResourceCall
24
)
25
```
26
27
## Capabilities
28
29
### LLM Test Case
30
31
Standard test case for evaluating single LLM interactions, supporting inputs, outputs, context, and tool usage.
32
33
```python { .api }
34
class LLMTestCase:
35
"""
36
Represents a test case for evaluating LLM outputs.
37
38
Parameters:
39
- input (str): Input prompt to the LLM
40
- actual_output (str, optional): Actual output from the LLM
41
- expected_output (str, optional): Expected output
42
- context (List[str], optional): Context information
43
- retrieval_context (List[str], optional): Retrieved context for RAG applications
44
- additional_metadata (Dict, optional): Additional metadata
45
- tools_called (List[ToolCall], optional): Tools called by the LLM
46
- expected_tools (List[ToolCall], optional): Expected tools to be called
47
- comments (str, optional): Comments about the test case
48
- token_cost (float, optional): Cost in tokens
49
- completion_time (float, optional): Time to complete in seconds
50
- name (str, optional): Name of the test case
51
- tags (List[str], optional): Tags for organization
52
- mcp_servers (List[MCPServer], optional): MCP servers configuration
53
- mcp_tools_called (List[MCPToolCall], optional): MCP tools called
54
- mcp_resources_called (List[MCPResourceCall], optional): MCP resources called
55
- mcp_prompts_called (List[MCPPromptCall], optional): MCP prompts called
56
"""
57
```
58
59
Usage example:
60
61
```python
62
from deepeval.test_case import LLMTestCase
63
64
# Basic test case
65
test_case = LLMTestCase(
66
input="What is the capital of France?",
67
actual_output="The capital of France is Paris.",
68
expected_output="Paris"
69
)
70
71
# RAG test case with retrieval context
72
rag_test_case = LLMTestCase(
73
input="What's our refund policy?",
74
actual_output="We offer a 30-day full refund at no extra cost.",
75
expected_output="30-day full refund policy",
76
retrieval_context=[
77
"All customers are eligible for a 30 day full refund at no extra costs.",
78
"Refunds are processed within 5-7 business days."
79
],
80
context=["Customer support FAQ"]
81
)
82
83
# Agentic test case with tool calls
84
agentic_test_case = LLMTestCase(
85
input="What's the weather in New York?",
86
actual_output="The current weather in New York is 72°F and sunny.",
87
tools_called=[
88
ToolCall(
89
name="get_weather",
90
input_parameters={"location": "New York", "unit": "fahrenheit"},
91
output={"temperature": 72, "condition": "sunny"}
92
)
93
],
94
expected_tools=[
95
ToolCall(name="get_weather", input_parameters={"location": "New York"})
96
]
97
)
98
```
99
100
### LLM Test Case Parameters
101
102
Enumeration of test case parameters for use with metrics.
103
104
```python { .api }
105
class LLMTestCaseParams:
106
"""
107
Enumeration of test case parameters.
108
109
Values:
110
- INPUT: "input"
111
- ACTUAL_OUTPUT: "actual_output"
112
- EXPECTED_OUTPUT: "expected_output"
113
- CONTEXT: "context"
114
- RETRIEVAL_CONTEXT: "retrieval_context"
115
- TOOLS_CALLED: "tools_called"
116
- EXPECTED_TOOLS: "expected_tools"
117
- MCP_SERVERS: "mcp_servers"
118
- MCP_TOOLS_CALLED: "mcp_tools_called"
119
- MCP_RESOURCES_CALLED: "mcp_resources_called"
120
- MCP_PROMPTS_CALLED: "mcp_prompts_called"
121
"""
122
```
123
124
Usage example:
125
126
```python
127
from deepeval.metrics import GEval
128
from deepeval.test_case import LLMTestCaseParams
129
130
# Use params to specify what to evaluate
131
metric = GEval(
132
name="Answer Relevancy",
133
criteria="Determine if the actual output is relevant to the input.",
134
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
135
)
136
```
137
138
### Tool Call
139
140
Represents a tool call made by an LLM or expected to be called.
141
142
```python { .api }
143
class ToolCall:
144
"""
145
Represents a tool call made by an LLM.
146
147
Parameters:
148
- name (str): Name of the tool
149
- description (str, optional): Description of the tool
150
- reasoning (str, optional): Reasoning for calling the tool
151
- output (Any, optional): Output from the tool
152
- input_parameters (Dict[str, Any], optional): Input parameters to the tool
153
"""
154
```
155
156
Usage example:
157
158
```python
159
from deepeval.test_case import ToolCall
160
161
# Define a tool call
162
tool_call = ToolCall(
163
name="search_database",
164
description="Searches the product database",
165
reasoning="Need to find product information",
166
input_parameters={
167
"query": "wireless headphones",
168
"max_results": 10
169
},
170
output=[
171
{"id": 1, "name": "Premium Wireless Headphones"},
172
{"id": 2, "name": "Budget Wireless Headphones"}
173
]
174
)
175
```
176
177
### Tool Call Parameters
178
179
Enumeration of tool call parameters.
180
181
```python { .api }
182
class ToolCallParams:
183
"""
184
Enumeration of tool call parameters.
185
186
Values:
187
- INPUT_PARAMETERS: "input_parameters"
188
- OUTPUT: "output"
189
"""
190
```
191
192
### Conversational Test Case
193
194
Test case for evaluating multi-turn conversational interactions.
195
196
```python { .api }
197
class ConversationalTestCase:
198
"""
199
Represents a multi-turn conversational test case.
200
201
Parameters:
202
- turns (List[Turn]): List of conversation turns
203
- scenario (str, optional): Scenario description
204
- context (List[str], optional): Context information
205
- name (str, optional): Name of the test case
206
- user_description (str, optional): Description of the user
207
- expected_outcome (str, optional): Expected outcome of the conversation
208
- chatbot_role (str, optional): Role of the chatbot
209
- additional_metadata (Dict, optional): Additional metadata
210
- comments (str, optional): Comments
211
- tags (List[str], optional): Tags for organization
212
- mcp_servers (List[MCPServer], optional): MCP servers configuration
213
"""
214
```
215
216
Usage example:
217
218
```python
219
from deepeval.test_case import ConversationalTestCase, Turn
220
221
# Multi-turn customer support conversation
222
conversation = ConversationalTestCase(
223
scenario="Customer inquiring about product return",
224
chatbot_role="Customer support agent",
225
user_description="Customer who wants to return a product",
226
expected_outcome="Customer understands return process and is satisfied",
227
context=["30-day return policy", "Free return shipping"],
228
turns=[
229
Turn(
230
role="user",
231
content="I want to return my purchase"
232
),
233
Turn(
234
role="assistant",
235
content="I'd be happy to help with your return. Can you provide your order number?"
236
),
237
Turn(
238
role="user",
239
content="My order number is #12345"
240
),
241
Turn(
242
role="assistant",
243
content="Thank you. I've initiated your return. You'll receive a prepaid return label via email within 24 hours.",
244
retrieval_context=["Order #12345 placed on 2024-01-15"]
245
)
246
]
247
)
248
```
249
250
### Turn
251
252
Represents a single turn in a conversation.
253
254
```python { .api }
255
class Turn:
256
"""
257
Represents a single turn in a conversation.
258
259
Parameters:
260
- role (Literal["user", "assistant"]): Role of the speaker
261
- content (str): Content of the turn
262
- user_id (str, optional): User identifier
263
- retrieval_context (List[str], optional): Retrieved context for this turn
264
- tools_called (List[ToolCall], optional): Tools called during this turn
265
- mcp_tools_called (List[MCPToolCall], optional): MCP tools called
266
- mcp_resources_called (List[MCPResourceCall], optional): MCP resources called
267
- mcp_prompts_called (List[MCPPromptCall], optional): MCP prompts called
268
- additional_metadata (Dict, optional): Additional metadata
269
"""
270
```
271
272
Usage example:
273
274
```python
275
from deepeval.test_case import Turn, ToolCall
276
277
# Assistant turn with tool usage
278
turn = Turn(
279
role="assistant",
280
content="I've checked the weather for you. It's currently 72°F and sunny in New York.",
281
tools_called=[
282
ToolCall(
283
name="get_weather",
284
input_parameters={"city": "New York"},
285
output={"temp": 72, "condition": "sunny"}
286
)
287
],
288
retrieval_context=["User prefers Fahrenheit for temperature"]
289
)
290
```
291
292
### Turn Parameters
293
294
Enumeration of turn parameters for use with conversational metrics.
295
296
```python { .api }
297
class TurnParams:
298
"""
299
Enumeration of turn parameters.
300
301
Values:
302
- ROLE: "role"
303
- CONTENT: "content"
304
- SCENARIO: "scenario"
305
- EXPECTED_OUTCOME: "expected_outcome"
306
- RETRIEVAL_CONTEXT: "retrieval_context"
307
- TOOLS_CALLED: "tools_called"
308
- MCP_TOOLS: "mcp_tools_called"
309
- MCP_RESOURCES: "mcp_resources_called"
310
- MCP_PROMPTS: "mcp_prompts_called"
311
"""
312
```
313
314
### Multimodal LLM Test Case
315
316
Test case for evaluating multimodal LLM interactions involving text and images.
317
318
```python { .api }
319
class MLLMTestCase:
320
"""
321
Represents a test case for multimodal LLMs (text + images).
322
323
Parameters:
324
- input (List[Union[str, MLLMImage]]): Input with text and images
325
- actual_output (List[Union[str, MLLMImage]]): Actual output
326
- expected_output (List[Union[str, MLLMImage]], optional): Expected output
327
- context (List[Union[str, MLLMImage]], optional): Context
328
- retrieval_context (List[Union[str, MLLMImage]], optional): Retrieved context
329
- additional_metadata (Dict, optional): Additional metadata
330
- comments (str, optional): Comments
331
- tools_called (List[ToolCall], optional): Tools called
332
- expected_tools (List[ToolCall], optional): Expected tools
333
- token_cost (float, optional): Token cost
334
- completion_time (float, optional): Completion time in seconds
335
- name (str, optional): Name
336
"""
337
```
338
339
Usage example:
340
341
```python
342
from deepeval.test_case import MLLMTestCase, MLLMImage
343
344
# Image description test case
345
mllm_test_case = MLLMTestCase(
346
input=[
347
"Describe what you see in this image:",
348
MLLMImage(url="path/to/image.jpg", local=True)
349
],
350
actual_output=["A golden retriever playing in a park with a red ball."],
351
expected_output=["A dog playing with a ball in a park."]
352
)
353
354
# Visual question answering
355
vqa_test_case = MLLMTestCase(
356
input=[
357
"What color is the car in the image?",
358
MLLMImage(url="https://example.com/car.jpg")
359
],
360
actual_output=["The car is red."],
361
expected_output=["Red"]
362
)
363
```
364
365
### MLLM Image
366
367
Represents an image in a multimodal test case.
368
369
```python { .api }
370
class MLLMImage:
371
"""
372
Represents an image in a multimodal test case.
373
374
Parameters:
375
- url (str): URL or file path to the image
376
- local (bool, optional): Whether the image is local (default: False)
377
378
Computed Attributes (only populated for local images):
379
- filename (Optional[str]): Filename extracted from URL
380
- mimeType (Optional[str]): MIME type of the image
381
- dataBase64 (Optional[str]): Base64 encoded image data
382
383
Static Methods:
384
- process_url(url: str) -> str: Processes a URL and returns the processed path
385
- is_local_path(url: str) -> bool: Determines if a URL is a local file path
386
"""
387
```
388
389
Usage example:
390
391
```python
392
from deepeval.test_case import MLLMImage
393
394
# Local image
395
local_image = MLLMImage(
396
url="/path/to/local/image.png",
397
local=True
398
)
399
400
# Remote image
401
remote_image = MLLMImage(
402
url="https://example.com/image.jpg"
403
)
404
```
405
406
### MLLM Test Case Parameters
407
408
Enumeration of multimodal test case parameters.
409
410
```python { .api }
411
class MLLMTestCaseParams:
412
"""
413
Enumeration of multimodal test case parameters.
414
415
Values:
416
- INPUT: "input"
417
- ACTUAL_OUTPUT: "actual_output"
418
- EXPECTED_OUTPUT: "expected_output"
419
- CONTEXT: "context"
420
- RETRIEVAL_CONTEXT: "retrieval_context"
421
- TOOLS_CALLED: "tools_called"
422
- EXPECTED_TOOLS: "expected_tools"
423
"""
424
```
425
426
### Arena Test Case
427
428
Test case for comparing multiple LLM outputs in arena-style evaluation.
429
430
```python { .api }
431
class ArenaTestCase:
432
"""
433
Represents a test case for comparing multiple LLM outputs (arena-style).
434
435
Parameters:
436
- contestants (Dict[str, LLMTestCase]): Dictionary mapping contestant names to test cases
437
"""
438
```
439
440
Usage example:
441
442
```python
443
from deepeval.test_case import ArenaTestCase, LLMTestCase
444
from deepeval.metrics import ArenaGEval
445
446
# Compare outputs from different models
447
arena_test = ArenaTestCase(
448
contestants={
449
"gpt-4": LLMTestCase(
450
input="Write a haiku about coding",
451
actual_output="Lines of code flow\\nBugs emerge, then disappear\\nSoftware takes its form"
452
),
453
"claude-3": LLMTestCase(
454
input="Write a haiku about coding",
455
actual_output="Keys click through the night\\nAlgorithms come alive\\nCode compiles at dawn"
456
),
457
"gemini-pro": LLMTestCase(
458
input="Write a haiku about coding",
459
actual_output="Functions nested deep\\nVariables dance in loops\\nPrograms bloom to life"
460
)
461
}
462
)
463
464
# Evaluate which is best
465
arena_metric = ArenaGEval(
466
name="Haiku Quality",
467
criteria="Determine which haiku best captures the essence of coding"
468
)
469
arena_metric.measure(arena_test)
470
print(f"Winner: {arena_metric.winner}") # Returns name of winning contestant
471
```
472
473
### Arena
474
475
Container for multiple arena test cases.
476
477
```python { .api }
478
class Arena:
479
"""
480
Container for managing multiple arena test cases.
481
482
Parameters:
483
- test_cases (List[ArenaTestCase]): List of arena test cases to manage
484
"""
485
```
486
487
Usage example:
488
489
```python
490
from deepeval.test_case import Arena, ArenaTestCase, LLMTestCase
491
492
# Create multiple arena test cases
493
arena = Arena(test_cases=[
494
ArenaTestCase(contestants={
495
"model-a": LLMTestCase(input="Question 1", actual_output="Answer A1"),
496
"model-b": LLMTestCase(input="Question 1", actual_output="Answer B1")
497
}),
498
ArenaTestCase(contestants={
499
"model-a": LLMTestCase(input="Question 2", actual_output="Answer A2"),
500
"model-b": LLMTestCase(input="Question 2", actual_output="Answer B2")
501
})
502
])
503
```
504
505
### MCP Types
506
507
Model Context Protocol (MCP) support for advanced tool and resource management.
508
509
```python { .api }
510
class MCPServer:
511
"""
512
Represents an MCP (Model Context Protocol) server configuration.
513
514
Parameters:
515
- server_name (str): Name of the server
516
- transport (Literal["stdio", "sse", "streamable-http"], optional): Transport protocol
517
- available_tools (List, optional): Available tools
518
- available_resources (List, optional): Available resources
519
- available_prompts (List, optional): Available prompts
520
"""
521
522
class MCPToolCall(BaseModel):
523
"""
524
Represents an MCP tool call.
525
526
Parameters:
527
- name (str): Name of the tool
528
- args (Dict): Tool arguments
529
- result (object): Tool execution result
530
"""
531
532
class MCPResourceCall(BaseModel):
533
"""
534
Represents an MCP resource call.
535
536
Parameters:
537
- uri (AnyUrl): URI of the resource (pydantic AnyUrl type)
538
- result (object): Resource retrieval result
539
"""
540
541
class MCPPromptCall(BaseModel):
542
"""
543
Represents an MCP prompt call.
544
545
Parameters:
546
- name (str): Name of the prompt
547
- result (object): Prompt execution result
548
"""
549
```
550
551
Usage example:
552
553
```python
554
from deepeval.test_case import LLMTestCase, MCPServer, MCPToolCall
555
556
# Test case with MCP server usage
557
mcp_test_case = LLMTestCase(
558
input="Search for Python tutorials",
559
actual_output="Here are the top Python tutorials I found...",
560
mcp_servers=[
561
MCPServer(
562
server_name="search-server",
563
transport="stdio",
564
available_tools=["web_search", "database_query"]
565
)
566
],
567
mcp_tools_called=[
568
MCPToolCall(
569
name="web_search",
570
args={"query": "Python tutorials", "limit": 10},
571
result={"count": 10, "results": [...]}
572
)
573
]
574
)
575
```
576