Tessl Tile for pypi/deepeval@3.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agentic-metrics.md benchmarks.md content-quality-metrics.md conversational-metrics.md core-evaluation.md custom-metrics.md dataset.md index.md integrations.md models.md multimodal-metrics.md rag-metrics.md synthesizer.md test-cases.md tracing.md

conversational-metrics.mddocs/

0
# Conversational Metrics
1

2
Metrics designed for evaluating multi-turn conversations, measuring relevancy, completeness, role adherence, and conversational quality. These metrics work with `ConversationalTestCase` objects.
3

4
## Imports
5

6
```python
7
from deepeval.metrics import (
8
    ConversationalGEval,
9
    TurnRelevancyMetric,
10
    ConversationCompletenessMetric,
11
    RoleAdherenceMetric,
12
    MultiTurnMCPUseMetric,
13
    ConversationalDAGMetric
14
)
15
```
16

17
## Capabilities
18

19
### Conversational G-Eval
20

21
G-Eval for conversational test cases, allowing custom evaluation criteria for multi-turn conversations.
22

23
```python { .api }
24
class ConversationalGEval:
25
    """
26
    G-Eval for conversational test cases.
27

28
    Parameters:
29
    - name (str): Name of the metric
30
    - criteria (str): Evaluation criteria
31
    - evaluation_params (List[TurnParams]): Parameters to evaluate
32
    - evaluation_steps (List[str], optional): Steps for evaluation
33
    - threshold (float): Success threshold (default: 0.5)
34
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
35
    - async_mode (bool): Async mode (default: True)
36
    - strict_mode (bool): Strict mode (default: False)
37
    - verbose_mode (bool): Verbose mode (default: False)
38

39
    Attributes:
40
    - score (float): Evaluation score (0-1)
41
    - reason (str): Explanation of the score
42
    - success (bool): Whether score meets threshold
43
    """
44
```
45

46
Usage example:
47

48
```python
49
from deepeval.metrics import ConversationalGEval
50
from deepeval.test_case import ConversationalTestCase, Turn, TurnParams
51

52
# Create custom conversational metric
53
metric = ConversationalGEval(
54
    name="Customer Satisfaction",
55
    criteria="Evaluate if the conversation leads to customer satisfaction",
56
    evaluation_params=[
57
        TurnParams.CONTENT,
58
        TurnParams.SCENARIO,
59
        TurnParams.EXPECTED_OUTCOME
60
    ],
61
    evaluation_steps=[
62
        "Analyze if agent addressed customer concerns",
63
        "Check if agent was polite and professional",
64
        "Evaluate if the expected outcome was achieved"
65
    ],
66
    threshold=0.7
67
)
68

69
# Create conversational test case
70
conversation = ConversationalTestCase(
71
    scenario="Customer wants to return a defective product",
72
    expected_outcome="Customer receives return label and is satisfied",
73
    turns=[
74
        Turn(role="user", content="My product arrived broken"),
75
        Turn(role="assistant", content="I'm sorry to hear that. Can you provide your order number?"),
76
        Turn(role="user", content="Order #12345"),
77
        Turn(role="assistant", content="I've initiated a return. You'll receive a prepaid label via email.")
78
    ]
79
)
80

81
# Evaluate
82
metric.measure(conversation)
83
print(f"Customer satisfaction score: {metric.score:.2f}")
84
print(f"Reason: {metric.reason}")
85
```
86

87
### Turn Relevancy Metric
88

89
Measures relevancy of conversation turns to the overall scenario and context.
90

91
```python { .api }
92
class TurnRelevancyMetric:
93
    """
94
    Measures relevancy of conversation turns.
95

96
    Parameters:
97
    - threshold (float): Success threshold (0-1, default: 0.5)
98
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
99
    - include_reason (bool): Include reason in output (default: True)
100
    - async_mode (bool): Async mode (default: True)
101

102
    Required Test Case Parameters:
103
    - TURNS
104
    - SCENARIO
105

106
    Attributes:
107
    - score (float): Turn relevancy score (0-1)
108
    - reason (str): Explanation identifying irrelevant turns
109
    - success (bool): Whether score meets threshold
110
    """
111
```
112

113
Usage example:
114

115
```python
116
from deepeval.metrics import TurnRelevancyMetric
117
from deepeval.test_case import ConversationalTestCase, Turn
118

119
metric = TurnRelevancyMetric(threshold=0.8)
120

121
conversation = ConversationalTestCase(
122
    scenario="Customer inquiring about shipping status",
123
    turns=[
124
        Turn(role="user", content="Where is my order?"),
125
        Turn(role="assistant", content="Let me check. What's your order number?"),
126
        Turn(role="user", content="#12345"),
127
        Turn(role="assistant", content="By the way, did you know we have a new product line?"),  # Irrelevant
128
        Turn(role="assistant", content="Your order is out for delivery today")
129
    ]
130
)
131

132
metric.measure(conversation)
133

134
if not metric.success:
135
    print(f"Irrelevant turns detected: {metric.reason}")
136
```
137

138
### Conversation Completeness Metric
139

140
Evaluates completeness of conversations based on expected outcomes and scenario requirements.
141

142
```python { .api }
143
class ConversationCompletenessMetric:
144
    """
145
    Evaluates completeness of conversations.
146

147
    Parameters:
148
    - threshold (float): Success threshold (0-1, default: 0.5)
149
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
150
    - include_reason (bool): Include reason in output (default: True)
151
    - async_mode (bool): Async mode (default: True)
152

153
    Required Test Case Parameters:
154
    - TURNS
155
    - SCENARIO
156
    - EXPECTED_OUTCOME
157

158
    Attributes:
159
    - score (float): Completeness score (0-1)
160
    - reason (str): Explanation of what's incomplete
161
    - success (bool): Whether score meets threshold
162
    """
163
```
164

165
Usage example:
166

167
```python
168
from deepeval.metrics import ConversationCompletenessMetric
169
from deepeval.test_case import ConversationalTestCase, Turn
170

171
metric = ConversationCompletenessMetric(threshold=0.8)
172

173
# Incomplete conversation
174
incomplete_conversation = ConversationalTestCase(
175
    scenario="Customer wants to change shipping address",
176
    expected_outcome="Shipping address is updated and confirmed",
177
    turns=[
178
        Turn(role="user", content="I need to change my shipping address"),
179
        Turn(role="assistant", content="I can help with that. What's your order number?"),
180
        Turn(role="user", content="#12345")
181
        # Conversation ends without address change
182
    ]
183
)
184

185
metric.measure(incomplete_conversation)
186

187
if not metric.success:
188
    print(f"Incomplete: {metric.reason}")
189
    # Example: "Expected outcome 'address is updated' was not achieved"
190

191
# Complete conversation
192
complete_conversation = ConversationalTestCase(
193
    scenario="Customer wants to change shipping address",
194
    expected_outcome="Shipping address is updated and confirmed",
195
    turns=[
196
        Turn(role="user", content="I need to change my shipping address"),
197
        Turn(role="assistant", content="I can help with that. What's your order number?"),
198
        Turn(role="user", content="#12345"),
199
        Turn(role="assistant", content="What's the new address?"),
200
        Turn(role="user", content="123 Main St, New York, NY 10001"),
201
        Turn(role="assistant", content="Updated! Your order will ship to 123 Main St, New York, NY 10001")
202
    ]
203
)
204

205
metric.measure(complete_conversation)
206
print(f"Completeness: {metric.score:.2f}")
207
```
208

209
### Role Adherence Metric
210

211
Measures adherence to assigned role in conversations.
212

213
```python { .api }
214
class RoleAdherenceMetric:
215
    """
216
    Measures adherence to assigned role in conversations.
217

218
    Parameters:
219
    - threshold (float): Success threshold (0-1, default: 0.5)
220
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
221
    - include_reason (bool): Include reason in output (default: True)
222

223
    Required Test Case Parameters:
224
    - TURNS
225
    - CHATBOT_ROLE or role defined in test case
226

227
    Attributes:
228
    - score (float): Role adherence score (0-1)
229
    - reason (str): Explanation of role violations
230
    - success (bool): Whether score meets threshold
231
    """
232
```
233

234
Usage example:
235

236
```python
237
from deepeval.metrics import RoleAdherenceMetric
238
from deepeval.test_case import ConversationalTestCase, Turn
239

240
metric = RoleAdherenceMetric(threshold=0.8)
241

242
conversation = ConversationalTestCase(
243
    scenario="Technical support for printer issue",
244
    chatbot_role="Technical support specialist for printers",
245
    turns=[
246
        Turn(role="user", content="My printer won't print"),
247
        Turn(role="assistant", content="Let me help you troubleshoot. Is the printer powered on?"),
248
        Turn(role="user", content="Yes, it's on"),
249
        Turn(role="assistant", content="Check the paper tray and ink levels"),
250
        Turn(role="user", content="How's the weather today?"),
251
        Turn(role="assistant", content="The weather is sunny, 75°F.")  # Role violation
252
    ]
253
)
254

255
metric.measure(conversation)
256

257
if not metric.success:
258
    print(f"Role violation: {metric.reason}")
259
```
260

261
### Multi-Turn MCP Use Metric
262

263
Evaluates MCP usage across multiple conversation turns.
264

265
```python { .api }
266
class MultiTurnMCPUseMetric:
267
    """
268
    Evaluates MCP usage across multiple turns.
269

270
    Parameters:
271
    - threshold (float): Success threshold (0-1, default: 0.5)
272
    - model (Union[str, DeepEvalBaseLLM], optional): Evaluation model
273

274
    Required Test Case Parameters:
275
    - TURNS (with MCP tools/resources/prompts)
276
    - MCP_SERVERS
277

278
    Attributes:
279
    - score (float): MCP usage score (0-1)
280
    - reason (str): Explanation of MCP usage quality
281
    - success (bool): Whether score meets threshold
282
    """
283
```
284

285
Usage example:
286

287
```python
288
from deepeval.metrics import MultiTurnMCPUseMetric
289
from deepeval.test_case import (
290
    ConversationalTestCase,
291
    Turn,
292
    MCPServer,
293
    MCPToolCall
294
)
295

296
metric = MultiTurnMCPUseMetric(threshold=0.7)
297

298
conversation = ConversationalTestCase(
299
    scenario="Research assistant helping with data analysis",
300
    mcp_servers=[
301
        MCPServer(
302
            server_name="data-server",
303
            available_tools=["query_database", "generate_chart"]
304
        )
305
    ],
306
    turns=[
307
        Turn(
308
            role="user",
309
            content="Show me sales data for Q1"
310
        ),
311
        Turn(
312
            role="assistant",
313
            content="Here's the Q1 sales data...",
314
            mcp_tools_called=[
315
                MCPToolCall(
316
                    server_name="data-server",
317
                    tool_name="query_database",
318
                    arguments={"query": "SELECT * FROM sales WHERE quarter='Q1'"}
319
                )
320
            ]
321
        ),
322
        Turn(
323
            role="user",
324
            content="Can you create a chart?"
325
        ),
326
        Turn(
327
            role="assistant",
328
            content="Here's a chart of the data...",
329
            mcp_tools_called=[
330
                MCPToolCall(
331
                    server_name="data-server",
332
                    tool_name="generate_chart",
333
                    arguments={"data": [...], "type": "bar"}
334
                )
335
            ]
336
        )
337
    ]
338
)
339

340
metric.measure(conversation)
341
```
342

343
### Conversational DAG Metric
344

345
DAG (Deep Acyclic Graph) metric for conversational flows.
346

347
```python { .api }
348
class ConversationalDAGMetric:
349
    """
350
    DAG metric for conversational flows.
351

352
    Parameters:
353
    - name (str): Name of the metric
354
    - dag (DeepAcyclicGraph): DAG structure for conversation evaluation
355
    - threshold (float): Success threshold (default: 0.5)
356

357
    Required Test Case Parameters:
358
    - TURNS
359

360
    Attributes:
361
    - score (float): DAG compliance score (0-1)
362
    - reason (str): Explanation of DAG evaluation
363
    - success (bool): Whether score meets threshold
364
    """
365
```
366

367
Usage example:
368

369
```python
370
from deepeval.metrics import ConversationalDAGMetric, DeepAcyclicGraph
371
from deepeval.test_case import ConversationalTestCase, Turn
372

373
# Define conversation flow DAG
374
conversation_dag = DeepAcyclicGraph()
375
conversation_dag.add_node("greeting", "Agent greets customer")
376
conversation_dag.add_node("identify_issue", "Identify customer issue")
377
conversation_dag.add_node("resolve_issue", "Resolve the issue")
378
conversation_dag.add_node("confirm_resolution", "Confirm issue is resolved")
379

380
conversation_dag.add_edge("greeting", "identify_issue")
381
conversation_dag.add_edge("identify_issue", "resolve_issue")
382
conversation_dag.add_edge("resolve_issue", "confirm_resolution")
383

384
# Create metric
385
metric = ConversationalDAGMetric(
386
    name="Support Flow",
387
    dag=conversation_dag,
388
    threshold=0.8
389
)
390

391
# Evaluate conversation against DAG
392
conversation = ConversationalTestCase(
393
    scenario="Customer support interaction",
394
    turns=[
395
        Turn(role="assistant", content="Hello! How can I help you today?"),  # greeting
396
        Turn(role="user", content="My order hasn't arrived"),
397
        Turn(role="assistant", content="Let me look that up for you"),  # identify_issue
398
        Turn(role="assistant", content="I've located your order and will expedite it"),  # resolve_issue
399
        Turn(role="assistant", content="Is there anything else I can help with?")  # confirm_resolution
400
    ]
401
)
402

403
metric.measure(conversation)
404
```
405

406
## Comprehensive Conversational Evaluation
407

408
Evaluate all conversational aspects together:
409

410
```python
411
from deepeval import evaluate
412
from deepeval.metrics import (
413
    ConversationalGEval,
414
    TurnRelevancyMetric,
415
    ConversationCompletenessMetric,
416
    RoleAdherenceMetric
417
)
418
from deepeval.test_case import ConversationalTestCase, Turn, TurnParams
419

420
# Create comprehensive conversational metrics
421
conv_metrics = [
422
    ConversationalGEval(
423
        name="Overall Quality",
424
        criteria="Evaluate conversation quality and helpfulness",
425
        evaluation_params=[TurnParams.CONTENT, TurnParams.SCENARIO],
426
        threshold=0.7
427
    ),
428
    TurnRelevancyMetric(threshold=0.8),
429
    ConversationCompletenessMetric(threshold=0.8),
430
    RoleAdherenceMetric(threshold=0.8)
431
]
432

433
# Test conversations
434
conversations = [
435
    ConversationalTestCase(
436
        scenario="Product inquiry",
437
        chatbot_role="Sales assistant",
438
        expected_outcome="Customer receives product information",
439
        turns=[...]
440
    ),
441
    # ... more conversations
442
]
443

444
# Evaluate
445
result = evaluate(conversations, conv_metrics)
446

447
# Analyze results
448
for test_result in result.test_results:
449
    print(f"\nConversation: {test_result.name}")
450
    for metric_name, metric_result in test_result.metrics.items():
451
        status = "✓" if metric_result.success else "✗"
452
        print(f"  {status} {metric_name}: {metric_result.score:.2f}")
453
```
454

455
## Evaluating Chatbot Personality
456

457
Use ConversationalGEval to evaluate personality traits:
458

459
```python
460
from deepeval.metrics import ConversationalGEval
461
from deepeval.test_case import ConversationalTestCase, TurnParams
462

463
# Evaluate empathy
464
empathy_metric = ConversationalGEval(
465
    name="Empathy",
466
    criteria="Evaluate if the chatbot shows empathy and understanding of user emotions",
467
    evaluation_params=[TurnParams.CONTENT],
468
    threshold=0.8
469
)
470

471
# Evaluate professionalism
472
professionalism_metric = ConversationalGEval(
473
    name="Professionalism",
474
    criteria="Evaluate if the chatbot maintains professional tone and language",
475
    evaluation_params=[TurnParams.CONTENT],
476
    threshold=0.8
477
)
478

479
# Evaluate helpfulness
480
helpfulness_metric = ConversationalGEval(
481
    name="Helpfulness",
482
    criteria="Evaluate if the chatbot provides helpful and actionable information",
483
    evaluation_params=[TurnParams.CONTENT, TurnParams.EXPECTED_OUTCOME],
484
    threshold=0.8
485
)
486

487
personality_metrics = [empathy_metric, professionalism_metric, helpfulness_metric]
488

489
# Evaluate chatbot personality
490
result = evaluate(conversations, personality_metrics)
491
```
492

Version

Tile

Files

conversational-metrics.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

conversational-metrics.mddocs/