or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agentic-metrics.mdbenchmarks.mdcontent-quality-metrics.mdconversational-metrics.mdcore-evaluation.mdcustom-metrics.mddataset.mdindex.mdintegrations.mdmodels.mdmultimodal-metrics.mdrag-metrics.mdsynthesizer.mdtest-cases.mdtracing.md

conversational-metrics.mddocs/

0

# Conversational Metrics

1

2

Metrics designed for evaluating multi-turn conversations, measuring relevancy, completeness, role adherence, and conversational quality. These metrics work with `ConversationalTestCase` objects.

3

4

## Imports

5

6

```python

7

from deepeval.metrics import (

8

ConversationalGEval,

9

TurnRelevancyMetric,

10

ConversationCompletenessMetric,

11

RoleAdherenceMetric,

12

MultiTurnMCPUseMetric,

13

ConversationalDAGMetric

14

)

15

```

16

17

## Capabilities

18

19

### Conversational G-Eval

20

21

G-Eval for conversational test cases, allowing custom evaluation criteria for multi-turn conversations.

22

23

```python { .api }

24

class ConversationalGEval:

25

"""

26

G-Eval for conversational test cases.

27

28

Parameters:

29

- name (str): Name of the metric

30

- criteria (str): Evaluation criteria

31

- evaluation_params (List[TurnParams]): Parameters to evaluate

32

- evaluation_steps (List[str], optional): Steps for evaluation

33

- threshold (float): Success threshold (default: 0.5)

34

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

35

- async_mode (bool): Async mode (default: True)

36

- strict_mode (bool): Strict mode (default: False)

37

- verbose_mode (bool): Verbose mode (default: False)

38

39

Attributes:

40

- score (float): Evaluation score (0-1)

41

- reason (str): Explanation of the score

42

- success (bool): Whether score meets threshold

43

"""

44

```

45

46

Usage example:

47

48

```python

49

from deepeval.metrics import ConversationalGEval

50

from deepeval.test_case import ConversationalTestCase, Turn, TurnParams

51

52

# Create custom conversational metric

53

metric = ConversationalGEval(

54

name="Customer Satisfaction",

55

criteria="Evaluate if the conversation leads to customer satisfaction",

56

evaluation_params=[

57

TurnParams.CONTENT,

58

TurnParams.SCENARIO,

59

TurnParams.EXPECTED_OUTCOME

60

],

61

evaluation_steps=[

62

"Analyze if agent addressed customer concerns",

63

"Check if agent was polite and professional",

64

"Evaluate if the expected outcome was achieved"

65

],

66

threshold=0.7

67

)

68

69

# Create conversational test case

70

conversation = ConversationalTestCase(

71

scenario="Customer wants to return a defective product",

72

expected_outcome="Customer receives return label and is satisfied",

73

turns=[

74

Turn(role="user", content="My product arrived broken"),

75

Turn(role="assistant", content="I'm sorry to hear that. Can you provide your order number?"),

76

Turn(role="user", content="Order #12345"),

77

Turn(role="assistant", content="I've initiated a return. You'll receive a prepaid label via email.")

78

]

79

)

80

81

# Evaluate

82

metric.measure(conversation)

83

print(f"Customer satisfaction score: {metric.score:.2f}")

84

print(f"Reason: {metric.reason}")

85

```

86

87

### Turn Relevancy Metric

88

89

Measures relevancy of conversation turns to the overall scenario and context.

90

91

```python { .api }

92

class TurnRelevancyMetric:

93

"""

94

Measures relevancy of conversation turns.

95

96

Parameters:

97

- threshold (float): Success threshold (0-1, default: 0.5)

98

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

99

- include_reason (bool): Include reason in output (default: True)

100

- async_mode (bool): Async mode (default: True)

101

102

Required Test Case Parameters:

103

- TURNS

104

- SCENARIO

105

106

Attributes:

107

- score (float): Turn relevancy score (0-1)

108

- reason (str): Explanation identifying irrelevant turns

109

- success (bool): Whether score meets threshold

110

"""

111

```

112

113

Usage example:

114

115

```python

116

from deepeval.metrics import TurnRelevancyMetric

117

from deepeval.test_case import ConversationalTestCase, Turn

118

119

metric = TurnRelevancyMetric(threshold=0.8)

120

121

conversation = ConversationalTestCase(

122

scenario="Customer inquiring about shipping status",

123

turns=[

124

Turn(role="user", content="Where is my order?"),

125

Turn(role="assistant", content="Let me check. What's your order number?"),

126

Turn(role="user", content="#12345"),

127

Turn(role="assistant", content="By the way, did you know we have a new product line?"), # Irrelevant

128

Turn(role="assistant", content="Your order is out for delivery today")

129

]

130

)

131

132

metric.measure(conversation)

133

134

if not metric.success:

135

print(f"Irrelevant turns detected: {metric.reason}")

136

```

137

138

### Conversation Completeness Metric

139

140

Evaluates completeness of conversations based on expected outcomes and scenario requirements.

141

142

```python { .api }

143

class ConversationCompletenessMetric:

144

"""

145

Evaluates completeness of conversations.

146

147

Parameters:

148

- threshold (float): Success threshold (0-1, default: 0.5)

149

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

150

- include_reason (bool): Include reason in output (default: True)

151

- async_mode (bool): Async mode (default: True)

152

153

Required Test Case Parameters:

154

- TURNS

155

- SCENARIO

156

- EXPECTED_OUTCOME

157

158

Attributes:

159

- score (float): Completeness score (0-1)

160

- reason (str): Explanation of what's incomplete

161

- success (bool): Whether score meets threshold

162

"""

163

```

164

165

Usage example:

166

167

```python

168

from deepeval.metrics import ConversationCompletenessMetric

169

from deepeval.test_case import ConversationalTestCase, Turn

170

171

metric = ConversationCompletenessMetric(threshold=0.8)

172

173

# Incomplete conversation

174

incomplete_conversation = ConversationalTestCase(

175

scenario="Customer wants to change shipping address",

176

expected_outcome="Shipping address is updated and confirmed",

177

turns=[

178

Turn(role="user", content="I need to change my shipping address"),

179

Turn(role="assistant", content="I can help with that. What's your order number?"),

180

Turn(role="user", content="#12345")

181

# Conversation ends without address change

182

]

183

)

184

185

metric.measure(incomplete_conversation)

186

187

if not metric.success:

188

print(f"Incomplete: {metric.reason}")

189

# Example: "Expected outcome 'address is updated' was not achieved"

190

191

# Complete conversation

192

complete_conversation = ConversationalTestCase(

193

scenario="Customer wants to change shipping address",

194

expected_outcome="Shipping address is updated and confirmed",

195

turns=[

196

Turn(role="user", content="I need to change my shipping address"),

197

Turn(role="assistant", content="I can help with that. What's your order number?"),

198

Turn(role="user", content="#12345"),

199

Turn(role="assistant", content="What's the new address?"),

200

Turn(role="user", content="123 Main St, New York, NY 10001"),

201

Turn(role="assistant", content="Updated! Your order will ship to 123 Main St, New York, NY 10001")

202

]

203

)

204

205

metric.measure(complete_conversation)

206

print(f"Completeness: {metric.score:.2f}")

207

```

208

209

### Role Adherence Metric

210

211

Measures adherence to assigned role in conversations.

212

213

```python { .api }

214

class RoleAdherenceMetric:

215

"""

216

Measures adherence to assigned role in conversations.

217

218

Parameters:

219

- threshold (float): Success threshold (0-1, default: 0.5)

220

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

221

- include_reason (bool): Include reason in output (default: True)

222

223

Required Test Case Parameters:

224

- TURNS

225

- CHATBOT_ROLE or role defined in test case

226

227

Attributes:

228

- score (float): Role adherence score (0-1)

229

- reason (str): Explanation of role violations

230

- success (bool): Whether score meets threshold

231

"""

232

```

233

234

Usage example:

235

236

```python

237

from deepeval.metrics import RoleAdherenceMetric

238

from deepeval.test_case import ConversationalTestCase, Turn

239

240

metric = RoleAdherenceMetric(threshold=0.8)

241

242

conversation = ConversationalTestCase(

243

scenario="Technical support for printer issue",

244

chatbot_role="Technical support specialist for printers",

245

turns=[

246

Turn(role="user", content="My printer won't print"),

247

Turn(role="assistant", content="Let me help you troubleshoot. Is the printer powered on?"),

248

Turn(role="user", content="Yes, it's on"),

249

Turn(role="assistant", content="Check the paper tray and ink levels"),

250

Turn(role="user", content="How's the weather today?"),

251

Turn(role="assistant", content="The weather is sunny, 75°F.") # Role violation

252

]

253

)

254

255

metric.measure(conversation)

256

257

if not metric.success:

258

print(f"Role violation: {metric.reason}")

259

```

260

261

### Multi-Turn MCP Use Metric

262

263

Evaluates MCP usage across multiple conversation turns.

264

265

```python { .api }

266

class MultiTurnMCPUseMetric:

267

"""

268

Evaluates MCP usage across multiple turns.

269

270

Parameters:

271

- threshold (float): Success threshold (0-1, default: 0.5)

272

- model (Union[str, DeepEvalBaseLLM], optional): Evaluation model

273

274

Required Test Case Parameters:

275

- TURNS (with MCP tools/resources/prompts)

276

- MCP_SERVERS

277

278

Attributes:

279

- score (float): MCP usage score (0-1)

280

- reason (str): Explanation of MCP usage quality

281

- success (bool): Whether score meets threshold

282

"""

283

```

284

285

Usage example:

286

287

```python

288

from deepeval.metrics import MultiTurnMCPUseMetric

289

from deepeval.test_case import (

290

ConversationalTestCase,

291

Turn,

292

MCPServer,

293

MCPToolCall

294

)

295

296

metric = MultiTurnMCPUseMetric(threshold=0.7)

297

298

conversation = ConversationalTestCase(

299

scenario="Research assistant helping with data analysis",

300

mcp_servers=[

301

MCPServer(

302

server_name="data-server",

303

available_tools=["query_database", "generate_chart"]

304

)

305

],

306

turns=[

307

Turn(

308

role="user",

309

content="Show me sales data for Q1"

310

),

311

Turn(

312

role="assistant",

313

content="Here's the Q1 sales data...",

314

mcp_tools_called=[

315

MCPToolCall(

316

server_name="data-server",

317

tool_name="query_database",

318

arguments={"query": "SELECT * FROM sales WHERE quarter='Q1'"}

319

)

320

]

321

),

322

Turn(

323

role="user",

324

content="Can you create a chart?"

325

),

326

Turn(

327

role="assistant",

328

content="Here's a chart of the data...",

329

mcp_tools_called=[

330

MCPToolCall(

331

server_name="data-server",

332

tool_name="generate_chart",

333

arguments={"data": [...], "type": "bar"}

334

)

335

]

336

)

337

]

338

)

339

340

metric.measure(conversation)

341

```

342

343

### Conversational DAG Metric

344

345

DAG (Deep Acyclic Graph) metric for conversational flows.

346

347

```python { .api }

348

class ConversationalDAGMetric:

349

"""

350

DAG metric for conversational flows.

351

352

Parameters:

353

- name (str): Name of the metric

354

- dag (DeepAcyclicGraph): DAG structure for conversation evaluation

355

- threshold (float): Success threshold (default: 0.5)

356

357

Required Test Case Parameters:

358

- TURNS

359

360

Attributes:

361

- score (float): DAG compliance score (0-1)

362

- reason (str): Explanation of DAG evaluation

363

- success (bool): Whether score meets threshold

364

"""

365

```

366

367

Usage example:

368

369

```python

370

from deepeval.metrics import ConversationalDAGMetric, DeepAcyclicGraph

371

from deepeval.test_case import ConversationalTestCase, Turn

372

373

# Define conversation flow DAG

374

conversation_dag = DeepAcyclicGraph()

375

conversation_dag.add_node("greeting", "Agent greets customer")

376

conversation_dag.add_node("identify_issue", "Identify customer issue")

377

conversation_dag.add_node("resolve_issue", "Resolve the issue")

378

conversation_dag.add_node("confirm_resolution", "Confirm issue is resolved")

379

380

conversation_dag.add_edge("greeting", "identify_issue")

381

conversation_dag.add_edge("identify_issue", "resolve_issue")

382

conversation_dag.add_edge("resolve_issue", "confirm_resolution")

383

384

# Create metric

385

metric = ConversationalDAGMetric(

386

name="Support Flow",

387

dag=conversation_dag,

388

threshold=0.8

389

)

390

391

# Evaluate conversation against DAG

392

conversation = ConversationalTestCase(

393

scenario="Customer support interaction",

394

turns=[

395

Turn(role="assistant", content="Hello! How can I help you today?"), # greeting

396

Turn(role="user", content="My order hasn't arrived"),

397

Turn(role="assistant", content="Let me look that up for you"), # identify_issue

398

Turn(role="assistant", content="I've located your order and will expedite it"), # resolve_issue

399

Turn(role="assistant", content="Is there anything else I can help with?") # confirm_resolution

400

]

401

)

402

403

metric.measure(conversation)

404

```

405

406

## Comprehensive Conversational Evaluation

407

408

Evaluate all conversational aspects together:

409

410

```python

411

from deepeval import evaluate

412

from deepeval.metrics import (

413

ConversationalGEval,

414

TurnRelevancyMetric,

415

ConversationCompletenessMetric,

416

RoleAdherenceMetric

417

)

418

from deepeval.test_case import ConversationalTestCase, Turn, TurnParams

419

420

# Create comprehensive conversational metrics

421

conv_metrics = [

422

ConversationalGEval(

423

name="Overall Quality",

424

criteria="Evaluate conversation quality and helpfulness",

425

evaluation_params=[TurnParams.CONTENT, TurnParams.SCENARIO],

426

threshold=0.7

427

),

428

TurnRelevancyMetric(threshold=0.8),

429

ConversationCompletenessMetric(threshold=0.8),

430

RoleAdherenceMetric(threshold=0.8)

431

]

432

433

# Test conversations

434

conversations = [

435

ConversationalTestCase(

436

scenario="Product inquiry",

437

chatbot_role="Sales assistant",

438

expected_outcome="Customer receives product information",

439

turns=[...]

440

),

441

# ... more conversations

442

]

443

444

# Evaluate

445

result = evaluate(conversations, conv_metrics)

446

447

# Analyze results

448

for test_result in result.test_results:

449

print(f"\nConversation: {test_result.name}")

450

for metric_name, metric_result in test_result.metrics.items():

451

status = "✓" if metric_result.success else "✗"

452

print(f" {status} {metric_name}: {metric_result.score:.2f}")

453

```

454

455

## Evaluating Chatbot Personality

456

457

Use ConversationalGEval to evaluate personality traits:

458

459

```python

460

from deepeval.metrics import ConversationalGEval

461

from deepeval.test_case import ConversationalTestCase, TurnParams

462

463

# Evaluate empathy

464

empathy_metric = ConversationalGEval(

465

name="Empathy",

466

criteria="Evaluate if the chatbot shows empathy and understanding of user emotions",

467

evaluation_params=[TurnParams.CONTENT],

468

threshold=0.8

469

)

470

471

# Evaluate professionalism

472

professionalism_metric = ConversationalGEval(

473

name="Professionalism",

474

criteria="Evaluate if the chatbot maintains professional tone and language",

475

evaluation_params=[TurnParams.CONTENT],

476

threshold=0.8

477

)

478

479

# Evaluate helpfulness

480

helpfulness_metric = ConversationalGEval(

481

name="Helpfulness",

482

criteria="Evaluate if the chatbot provides helpful and actionable information",

483

evaluation_params=[TurnParams.CONTENT, TurnParams.EXPECTED_OUTCOME],

484

threshold=0.8

485

)

486

487

personality_metrics = [empathy_metric, professionalism_metric, helpfulness_metric]

488

489

# Evaluate chatbot personality

490

result = evaluate(conversations, personality_metrics)

491

```

492