0
# Multimodal Metrics
1
2
Metrics for evaluating multimodal LLM outputs involving text and images. These metrics assess image generation quality, visual question answering, image coherence, and multimodal RAG systems.
3
4
## Imports
5
6
```python
7
from deepeval.metrics import (
8
MultimodalGEval,
9
TextToImageMetric,
10
ImageEditingMetric,
11
ImageCoherenceMetric,
12
ImageHelpfulnessMetric,
13
ImageReferenceMetric,
14
MultimodalContextualRecallMetric,
15
MultimodalContextualRelevancyMetric,
16
MultimodalContextualPrecisionMetric,
17
MultimodalAnswerRelevancyMetric,
18
MultimodalFaithfulnessMetric,
19
MultimodalToolCorrectnessMetric
20
)
21
```
22
23
## Capabilities
24
25
### Multimodal G-Eval
26
27
G-Eval for multimodal test cases with custom evaluation criteria.
28
29
```python { .api }
30
class MultimodalGEval:
31
"""
32
G-Eval for multimodal test cases.
33
34
Parameters:
35
- name (str): Name of the metric
36
- criteria (str): Evaluation criteria
37
- evaluation_params (List[MLLMTestCaseParams]): Parameters to evaluate
38
- evaluation_steps (List[str], optional): Steps for evaluation
39
- threshold (float): Success threshold (default: 0.5)
40
- model (Union[str, DeepEvalBaseMLLM], optional): Multimodal evaluation model
41
- async_mode (bool): Async mode (default: True)
42
43
Attributes:
44
- score (float): Evaluation score (0-1)
45
- reason (str): Explanation
46
- success (bool): Whether score meets threshold
47
"""
48
```
49
50
### Text-to-Image Metric
51
52
Evaluates text-to-image generation quality.
53
54
```python { .api }
55
class TextToImageMetric:
56
"""
57
Evaluates text-to-image generation quality.
58
59
Parameters:
60
- threshold (float): Success threshold (default: 0.5)
61
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
62
- include_reason (bool): Include reason (default: True)
63
64
Required Test Case Parameters:
65
- INPUT (text prompt)
66
- ACTUAL_OUTPUT (generated image)
67
68
Attributes:
69
- score (float): Image quality score (0-1)
70
- reason (str): Explanation
71
- success (bool): Whether score meets threshold
72
"""
73
```
74
75
### Image Coherence Metric
76
77
Evaluates coherence of images in context.
78
79
```python { .api }
80
class ImageCoherenceMetric:
81
"""
82
Evaluates coherence of images in context.
83
84
Parameters:
85
- threshold (float): Success threshold (default: 0.5)
86
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
87
88
Required Test Case Parameters:
89
- INPUT
90
- ACTUAL_OUTPUT (images)
91
- CONTEXT
92
93
Attributes:
94
- score (float): Coherence score (0-1)
95
- reason (str): Explanation
96
- success (bool): Whether score meets threshold
97
"""
98
```
99
100
### Image Helpfulness Metric
101
102
Evaluates helpfulness of images in responses.
103
104
```python { .api }
105
class ImageHelpfulnessMetric:
106
"""
107
Evaluates helpfulness of images.
108
109
Parameters:
110
- threshold (float): Success threshold (default: 0.5)
111
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
112
113
Required Test Case Parameters:
114
- INPUT
115
- ACTUAL_OUTPUT (response with images)
116
117
Attributes:
118
- score (float): Helpfulness score (0-1)
119
- reason (str): Explanation
120
- success (bool): Whether score meets threshold
121
"""
122
```
123
124
### Multimodal RAG Metrics
125
126
RAG metrics adapted for multimodal inputs and outputs.
127
128
```python { .api }
129
class MultimodalAnswerRelevancyMetric:
130
"""
131
Answer relevancy for multimodal inputs.
132
133
Parameters:
134
- threshold (float): Success threshold (default: 0.5)
135
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
136
"""
137
138
class MultimodalFaithfulnessMetric:
139
"""
140
Faithfulness for multimodal outputs.
141
142
Parameters:
143
- threshold (float): Success threshold (default: 0.5)
144
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
145
"""
146
147
class MultimodalContextualRecallMetric:
148
"""
149
Contextual recall for multimodal inputs.
150
151
Parameters:
152
- threshold (float): Success threshold (default: 0.5)
153
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
154
"""
155
156
class MultimodalContextualRelevancyMetric:
157
"""
158
Contextual relevancy for multimodal inputs.
159
160
Parameters:
161
- threshold (float): Success threshold (default: 0.5)
162
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
163
"""
164
165
class MultimodalContextualPrecisionMetric:
166
"""
167
Contextual precision for multimodal inputs.
168
169
Parameters:
170
- threshold (float): Success threshold (default: 0.5)
171
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
172
"""
173
```
174
175
Usage example:
176
177
```python
178
from deepeval.metrics import (
179
MultimodalAnswerRelevancyMetric,
180
MultimodalFaithfulnessMetric
181
)
182
from deepeval.test_case import MLLMTestCase, MLLMImage
183
184
# Visual QA with retrieval
185
test_case = MLLMTestCase(
186
input=[
187
"What safety equipment is visible in this image?",
188
MLLMImage(url="construction_site.jpg", local=True)
189
],
190
actual_output=["Hard hats, safety vests, and steel-toed boots are visible."],
191
retrieval_context=[
192
"Safety requirements: hard hats, safety vests, steel-toed boots",
193
MLLMImage(url="safety_guide.jpg")
194
]
195
)
196
197
metrics = [
198
MultimodalAnswerRelevancyMetric(threshold=0.7),
199
MultimodalFaithfulnessMetric(threshold=0.8)
200
]
201
202
for metric in metrics:
203
metric.measure(test_case)
204
print(f"{metric.__class__.__name__}: {metric.score:.2f}")
205
```
206
207
### Multimodal Tool Correctness
208
209
Tool correctness for multimodal contexts.
210
211
```python { .api }
212
class MultimodalToolCorrectnessMetric:
213
"""
214
Tool correctness for multimodal contexts.
215
216
Parameters:
217
- threshold (float): Success threshold (default: 0.5)
218
- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model
219
220
Required Test Case Parameters:
221
- TOOLS_CALLED
222
- EXPECTED_TOOLS
223
"""
224
```
225