0
# Conversational Metrics
1
2
Metrics designed for evaluating multi-turn conversations, measuring relevancy, completeness, role adherence, and conversational quality. These metrics work with `ConversationalTestCase` objects.
3
4
## Imports
5
6
```python
7
from deepeval.metrics import (
8
ConversationalGEval,
9
TurnRelevancyMetric,
10
ConversationCompletenessMetric,
11
RoleAdherenceMetric,
12
MultiTurnMCPUseMetric,
13
ConversationalDAGMetric
14
)
15
```
16
17
## Capabilities
18
19
### Conversational G-Eval
20
21
G-Eval for conversational test cases, allowing custom evaluation criteria for multi-turn conversations.
22
23
```python { .api }
24
class ConversationalGEval:
25
"""
26
G-Eval for conversational test cases.
27
28
Parameters:
29
- name (str): Name of the metric
30
- criteria (str): Evaluation criteria
31
- evaluation_params (List[TurnParams]): Parameters to evaluate
32
- evaluation_steps (List[str], optional): Steps for evaluation
33
- threshold (float): Success threshold (default: 0.5)
34
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
35
- async_mode (bool): Async mode (default: True)
36
- strict_mode (bool): Strict mode (default: False)
37
- verbose_mode (bool): Verbose mode (default: False)
38
39
Attributes:
40
- score (float): Evaluation score (0-1)
41
- reason (str): Explanation of the score
42
- success (bool): Whether score meets threshold
43
"""
44
```
45
46
Usage example:
47
48
```python
49
from deepeval.metrics import ConversationalGEval
50
from deepeval.test_case import ConversationalTestCase, Turn, TurnParams
51
52
# Create custom conversational metric
53
metric = ConversationalGEval(
54
name="Customer Satisfaction",
55
criteria="Evaluate if the conversation leads to customer satisfaction",
56
evaluation_params=[
57
TurnParams.CONTENT,
58
TurnParams.SCENARIO,
59
TurnParams.EXPECTED_OUTCOME
60
],
61
evaluation_steps=[
62
"Analyze if agent addressed customer concerns",
63
"Check if agent was polite and professional",
64
"Evaluate if the expected outcome was achieved"
65
],
66
threshold=0.7
67
)
68
69
# Create conversational test case
70
conversation = ConversationalTestCase(
71
scenario="Customer wants to return a defective product",
72
expected_outcome="Customer receives return label and is satisfied",
73
turns=[
74
Turn(role="user", content="My product arrived broken"),
75
Turn(role="assistant", content="I'm sorry to hear that. Can you provide your order number?"),
76
Turn(role="user", content="Order #12345"),
77
Turn(role="assistant", content="I've initiated a return. You'll receive a prepaid label via email.")
78
]
79
)
80
81
# Evaluate
82
metric.measure(conversation)
83
print(f"Customer satisfaction score: {metric.score:.2f}")
84
print(f"Reason: {metric.reason}")
85
```
86
87
### Turn Relevancy Metric
88
89
Measures relevancy of conversation turns to the overall scenario and context.
90
91
```python { .api }
92
class TurnRelevancyMetric:
93
"""
94
Measures relevancy of conversation turns.
95
96
Parameters:
97
- threshold (float): Success threshold (0-1, default: 0.5)
98
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
99
- include_reason (bool): Include reason in output (default: True)
100
- async_mode (bool): Async mode (default: True)
101
102
Required Test Case Parameters:
103
- TURNS
104
- SCENARIO
105
106
Attributes:
107
- score (float): Turn relevancy score (0-1)
108
- reason (str): Explanation identifying irrelevant turns
109
- success (bool): Whether score meets threshold
110
"""
111
```
112
113
Usage example:
114
115
```python
116
from deepeval.metrics import TurnRelevancyMetric
117
from deepeval.test_case import ConversationalTestCase, Turn
118
119
metric = TurnRelevancyMetric(threshold=0.8)
120
121
conversation = ConversationalTestCase(
122
scenario="Customer inquiring about shipping status",
123
turns=[
124
Turn(role="user", content="Where is my order?"),
125
Turn(role="assistant", content="Let me check. What's your order number?"),
126
Turn(role="user", content="#12345"),
127
Turn(role="assistant", content="By the way, did you know we have a new product line?"), # Irrelevant
128
Turn(role="assistant", content="Your order is out for delivery today")
129
]
130
)
131
132
metric.measure(conversation)
133
134
if not metric.success:
135
print(f"Irrelevant turns detected: {metric.reason}")
136
```
137
138
### Conversation Completeness Metric
139
140
Evaluates completeness of conversations based on expected outcomes and scenario requirements.
141
142
```python { .api }
143
class ConversationCompletenessMetric:
144
"""
145
Evaluates completeness of conversations.
146
147
Parameters:
148
- threshold (float): Success threshold (0-1, default: 0.5)
149
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
150
- include_reason (bool): Include reason in output (default: True)
151
- async_mode (bool): Async mode (default: True)
152
153
Required Test Case Parameters:
154
- TURNS
155
- SCENARIO
156
- EXPECTED_OUTCOME
157
158
Attributes:
159
- score (float): Completeness score (0-1)
160
- reason (str): Explanation of what's incomplete
161
- success (bool): Whether score meets threshold
162
"""
163
```
164
165
Usage example:
166
167
```python
168
from deepeval.metrics import ConversationCompletenessMetric
169
from deepeval.test_case import ConversationalTestCase, Turn
170
171
metric = ConversationCompletenessMetric(threshold=0.8)
172
173
# Incomplete conversation
174
incomplete_conversation = ConversationalTestCase(
175
scenario="Customer wants to change shipping address",
176
expected_outcome="Shipping address is updated and confirmed",
177
turns=[
178
Turn(role="user", content="I need to change my shipping address"),
179
Turn(role="assistant", content="I can help with that. What's your order number?"),
180
Turn(role="user", content="#12345")
181
# Conversation ends without address change
182
]
183
)
184
185
metric.measure(incomplete_conversation)
186
187
if not metric.success:
188
print(f"Incomplete: {metric.reason}")
189
# Example: "Expected outcome 'address is updated' was not achieved"
190
191
# Complete conversation
192
complete_conversation = ConversationalTestCase(
193
scenario="Customer wants to change shipping address",
194
expected_outcome="Shipping address is updated and confirmed",
195
turns=[
196
Turn(role="user", content="I need to change my shipping address"),
197
Turn(role="assistant", content="I can help with that. What's your order number?"),
198
Turn(role="user", content="#12345"),
199
Turn(role="assistant", content="What's the new address?"),
200
Turn(role="user", content="123 Main St, New York, NY 10001"),
201
Turn(role="assistant", content="Updated! Your order will ship to 123 Main St, New York, NY 10001")
202
]
203
)
204
205
metric.measure(complete_conversation)
206
print(f"Completeness: {metric.score:.2f}")
207
```
208
209
### Role Adherence Metric
210
211
Measures adherence to assigned role in conversations.
212
213
```python { .api }
214
class RoleAdherenceMetric:
215
"""
216
Measures adherence to assigned role in conversations.
217
218
Parameters:
219
- threshold (float): Success threshold (0-1, default: 0.5)
220
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
221
- include_reason (bool): Include reason in output (default: True)
222
223
Required Test Case Parameters:
224
- TURNS
225
- CHATBOT_ROLE or role defined in test case
226
227
Attributes:
228
- score (float): Role adherence score (0-1)
229
- reason (str): Explanation of role violations
230
- success (bool): Whether score meets threshold
231
"""
232
```
233
234
Usage example:
235
236
```python
237
from deepeval.metrics import RoleAdherenceMetric
238
from deepeval.test_case import ConversationalTestCase, Turn
239
240
metric = RoleAdherenceMetric(threshold=0.8)
241
242
conversation = ConversationalTestCase(
243
scenario="Technical support for printer issue",
244
chatbot_role="Technical support specialist for printers",
245
turns=[
246
Turn(role="user", content="My printer won't print"),
247
Turn(role="assistant", content="Let me help you troubleshoot. Is the printer powered on?"),
248
Turn(role="user", content="Yes, it's on"),
249
Turn(role="assistant", content="Check the paper tray and ink levels"),
250
Turn(role="user", content="How's the weather today?"),
251
Turn(role="assistant", content="The weather is sunny, 75°F.") # Role violation
252
]
253
)
254
255
metric.measure(conversation)
256
257
if not metric.success:
258
print(f"Role violation: {metric.reason}")
259
```
260
261
### Multi-Turn MCP Use Metric
262
263
Evaluates MCP usage across multiple conversation turns.
264
265
```python { .api }
266
class MultiTurnMCPUseMetric:
267
"""
268
Evaluates MCP usage across multiple turns.
269
270
Parameters:
271
- threshold (float): Success threshold (0-1, default: 0.5)
272
- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
273
274
Required Test Case Parameters:
275
- TURNS (with MCP tools/resources/prompts)
276
- MCP_SERVERS
277
278
Attributes:
279
- score (float): MCP usage score (0-1)
280
- reason (str): Explanation of MCP usage quality
281
- success (bool): Whether score meets threshold
282
"""
283
```
284
285
Usage example:
286
287
```python
288
from deepeval.metrics import MultiTurnMCPUseMetric
289
from deepeval.test_case import (
290
ConversationalTestCase,
291
Turn,
292
MCPServer,
293
MCPToolCall
294
)
295
296
metric = MultiTurnMCPUseMetric(threshold=0.7)
297
298
conversation = ConversationalTestCase(
299
scenario="Research assistant helping with data analysis",
300
mcp_servers=[
301
MCPServer(
302
server_name="data-server",
303
available_tools=["query_database", "generate_chart"]
304
)
305
],
306
turns=[
307
Turn(
308
role="user",
309
content="Show me sales data for Q1"
310
),
311
Turn(
312
role="assistant",
313
content="Here's the Q1 sales data...",
314
mcp_tools_called=[
315
MCPToolCall(
316
server_name="data-server",
317
tool_name="query_database",
318
arguments={"query": "SELECT * FROM sales WHERE quarter='Q1'"}
319
)
320
]
321
),
322
Turn(
323
role="user",
324
content="Can you create a chart?"
325
),
326
Turn(
327
role="assistant",
328
content="Here's a chart of the data...",
329
mcp_tools_called=[
330
MCPToolCall(
331
server_name="data-server",
332
tool_name="generate_chart",
333
arguments={"data": [...], "type": "bar"}
334
)
335
]
336
)
337
]
338
)
339
340
metric.measure(conversation)
341
```
342
343
### Conversational DAG Metric
344
345
DAG (Deep Acyclic Graph) metric for conversational flows.
346
347
```python { .api }
348
class ConversationalDAGMetric:
349
"""
350
DAG metric for conversational flows.
351
352
Parameters:
353
- name (str): Name of the metric
354
- dag (DeepAcyclicGraph): DAG structure for conversation evaluation
355
- threshold (float): Success threshold (default: 0.5)
356
357
Required Test Case Parameters:
358
- TURNS
359
360
Attributes:
361
- score (float): DAG compliance score (0-1)
362
- reason (str): Explanation of DAG evaluation
363
- success (bool): Whether score meets threshold
364
"""
365
```
366
367
Usage example:
368
369
```python
370
from deepeval.metrics import ConversationalDAGMetric, DeepAcyclicGraph
371
from deepeval.test_case import ConversationalTestCase, Turn
372
373
# Define conversation flow DAG
374
conversation_dag = DeepAcyclicGraph()
375
conversation_dag.add_node("greeting", "Agent greets customer")
376
conversation_dag.add_node("identify_issue", "Identify customer issue")
377
conversation_dag.add_node("resolve_issue", "Resolve the issue")
378
conversation_dag.add_node("confirm_resolution", "Confirm issue is resolved")
379
380
conversation_dag.add_edge("greeting", "identify_issue")
381
conversation_dag.add_edge("identify_issue", "resolve_issue")
382
conversation_dag.add_edge("resolve_issue", "confirm_resolution")
383
384
# Create metric
385
metric = ConversationalDAGMetric(
386
name="Support Flow",
387
dag=conversation_dag,
388
threshold=0.8
389
)
390
391
# Evaluate conversation against DAG
392
conversation = ConversationalTestCase(
393
scenario="Customer support interaction",
394
turns=[
395
Turn(role="assistant", content="Hello! How can I help you today?"), # greeting
396
Turn(role="user", content="My order hasn't arrived"),
397
Turn(role="assistant", content="Let me look that up for you"), # identify_issue
398
Turn(role="assistant", content="I've located your order and will expedite it"), # resolve_issue
399
Turn(role="assistant", content="Is there anything else I can help with?") # confirm_resolution
400
]
401
)
402
403
metric.measure(conversation)
404
```
405
406
## Comprehensive Conversational Evaluation
407
408
Evaluate all conversational aspects together:
409
410
```python
411
from deepeval import evaluate
412
from deepeval.metrics import (
413
ConversationalGEval,
414
TurnRelevancyMetric,
415
ConversationCompletenessMetric,
416
RoleAdherenceMetric
417
)
418
from deepeval.test_case import ConversationalTestCase, Turn, TurnParams
419
420
# Create comprehensive conversational metrics
421
conv_metrics = [
422
ConversationalGEval(
423
name="Overall Quality",
424
criteria="Evaluate conversation quality and helpfulness",
425
evaluation_params=[TurnParams.CONTENT, TurnParams.SCENARIO],
426
threshold=0.7
427
),
428
TurnRelevancyMetric(threshold=0.8),
429
ConversationCompletenessMetric(threshold=0.8),
430
RoleAdherenceMetric(threshold=0.8)
431
]
432
433
# Test conversations
434
conversations = [
435
ConversationalTestCase(
436
scenario="Product inquiry",
437
chatbot_role="Sales assistant",
438
expected_outcome="Customer receives product information",
439
turns=[...]
440
),
441
# ... more conversations
442
]
443
444
# Evaluate
445
result = evaluate(conversations, conv_metrics)
446
447
# Analyze results
448
for test_result in result.test_results:
449
print(f"\nConversation: {test_result.name}")
450
for metric_name, metric_result in test_result.metrics.items():
451
status = "✓" if metric_result.success else "✗"
452
print(f" {status} {metric_name}: {metric_result.score:.2f}")
453
```
454
455
## Evaluating Chatbot Personality
456
457
Use ConversationalGEval to evaluate personality traits:
458
459
```python
460
from deepeval.metrics import ConversationalGEval
461
from deepeval.test_case import ConversationalTestCase, TurnParams
462
463
# Evaluate empathy
464
empathy_metric = ConversationalGEval(
465
name="Empathy",
466
criteria="Evaluate if the chatbot shows empathy and understanding of user emotions",
467
evaluation_params=[TurnParams.CONTENT],
468
threshold=0.8
469
)
470
471
# Evaluate professionalism
472
professionalism_metric = ConversationalGEval(
473
name="Professionalism",
474
criteria="Evaluate if the chatbot maintains professional tone and language",
475
evaluation_params=[TurnParams.CONTENT],
476
threshold=0.8
477
)
478
479
# Evaluate helpfulness
480
helpfulness_metric = ConversationalGEval(
481
name="Helpfulness",
482
criteria="Evaluate if the chatbot provides helpful and actionable information",
483
evaluation_params=[TurnParams.CONTENT, TurnParams.EXPECTED_OUTCOME],
484
threshold=0.8
485
)
486
487
personality_metrics = [empathy_metric, professionalism_metric, helpfulness_metric]
488
489
# Evaluate chatbot personality
490
result = evaluate(conversations, personality_metrics)
491
```
492