or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agentic-metrics.mdbenchmarks.mdcontent-quality-metrics.mdconversational-metrics.mdcore-evaluation.mdcustom-metrics.mddataset.mdindex.mdintegrations.mdmodels.mdmultimodal-metrics.mdrag-metrics.mdsynthesizer.mdtest-cases.mdtracing.md

multimodal-metrics.mddocs/

0

# Multimodal Metrics

1

2

Metrics for evaluating multimodal LLM outputs involving text and images. These metrics assess image generation quality, visual question answering, image coherence, and multimodal RAG systems.

3

4

## Imports

5

6

```python

7

from deepeval.metrics import (

8

MultimodalGEval,

9

TextToImageMetric,

10

ImageEditingMetric,

11

ImageCoherenceMetric,

12

ImageHelpfulnessMetric,

13

ImageReferenceMetric,

14

MultimodalContextualRecallMetric,

15

MultimodalContextualRelevancyMetric,

16

MultimodalContextualPrecisionMetric,

17

MultimodalAnswerRelevancyMetric,

18

MultimodalFaithfulnessMetric,

19

MultimodalToolCorrectnessMetric

20

)

21

```

22

23

## Capabilities

24

25

### Multimodal G-Eval

26

27

G-Eval for multimodal test cases with custom evaluation criteria.

28

29

```python { .api }

30

class MultimodalGEval:

31

"""

32

G-Eval for multimodal test cases.

33

34

Parameters:

35

- name (str): Name of the metric

36

- criteria (str): Evaluation criteria

37

- evaluation_params (List[MLLMTestCaseParams]): Parameters to evaluate

38

- evaluation_steps (List[str], optional): Steps for evaluation

39

- threshold (float): Success threshold (default: 0.5)

40

- model (Union[str, DeepEvalBaseMLLM], optional): Multimodal evaluation model

41

- async_mode (bool): Async mode (default: True)

42

43

Attributes:

44

- score (float): Evaluation score (0-1)

45

- reason (str): Explanation

46

- success (bool): Whether score meets threshold

47

"""

48

```

49

50

### Text-to-Image Metric

51

52

Evaluates text-to-image generation quality.

53

54

```python { .api }

55

class TextToImageMetric:

56

"""

57

Evaluates text-to-image generation quality.

58

59

Parameters:

60

- threshold (float): Success threshold (default: 0.5)

61

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

62

- include_reason (bool): Include reason (default: True)

63

64

Required Test Case Parameters:

65

- INPUT (text prompt)

66

- ACTUAL_OUTPUT (generated image)

67

68

Attributes:

69

- score (float): Image quality score (0-1)

70

- reason (str): Explanation

71

- success (bool): Whether score meets threshold

72

"""

73

```

74

75

### Image Coherence Metric

76

77

Evaluates coherence of images in context.

78

79

```python { .api }

80

class ImageCoherenceMetric:

81

"""

82

Evaluates coherence of images in context.

83

84

Parameters:

85

- threshold (float): Success threshold (default: 0.5)

86

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

87

88

Required Test Case Parameters:

89

- INPUT

90

- ACTUAL_OUTPUT (images)

91

- CONTEXT

92

93

Attributes:

94

- score (float): Coherence score (0-1)

95

- reason (str): Explanation

96

- success (bool): Whether score meets threshold

97

"""

98

```

99

100

### Image Helpfulness Metric

101

102

Evaluates helpfulness of images in responses.

103

104

```python { .api }

105

class ImageHelpfulnessMetric:

106

"""

107

Evaluates helpfulness of images.

108

109

Parameters:

110

- threshold (float): Success threshold (default: 0.5)

111

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

112

113

Required Test Case Parameters:

114

- INPUT

115

- ACTUAL_OUTPUT (response with images)

116

117

Attributes:

118

- score (float): Helpfulness score (0-1)

119

- reason (str): Explanation

120

- success (bool): Whether score meets threshold

121

"""

122

```

123

124

### Multimodal RAG Metrics

125

126

RAG metrics adapted for multimodal inputs and outputs.

127

128

```python { .api }

129

class MultimodalAnswerRelevancyMetric:

130

"""

131

Answer relevancy for multimodal inputs.

132

133

Parameters:

134

- threshold (float): Success threshold (default: 0.5)

135

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

136

"""

137

138

class MultimodalFaithfulnessMetric:

139

"""

140

Faithfulness for multimodal outputs.

141

142

Parameters:

143

- threshold (float): Success threshold (default: 0.5)

144

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

145

"""

146

147

class MultimodalContextualRecallMetric:

148

"""

149

Contextual recall for multimodal inputs.

150

151

Parameters:

152

- threshold (float): Success threshold (default: 0.5)

153

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

154

"""

155

156

class MultimodalContextualRelevancyMetric:

157

"""

158

Contextual relevancy for multimodal inputs.

159

160

Parameters:

161

- threshold (float): Success threshold (default: 0.5)

162

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

163

"""

164

165

class MultimodalContextualPrecisionMetric:

166

"""

167

Contextual precision for multimodal inputs.

168

169

Parameters:

170

- threshold (float): Success threshold (default: 0.5)

171

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

172

"""

173

```

174

175

Usage example:

176

177

```python

178

from deepeval.metrics import (

179

MultimodalAnswerRelevancyMetric,

180

MultimodalFaithfulnessMetric

181

)

182

from deepeval.test_case import MLLMTestCase, MLLMImage

183

184

# Visual QA with retrieval

185

test_case = MLLMTestCase(

186

input=[

187

"What safety equipment is visible in this image?",

188

MLLMImage(url="construction_site.jpg", local=True)

189

],

190

actual_output=["Hard hats, safety vests, and steel-toed boots are visible."],

191

retrieval_context=[

192

"Safety requirements: hard hats, safety vests, steel-toed boots",

193

MLLMImage(url="safety_guide.jpg")

194

]

195

)

196

197

metrics = [

198

MultimodalAnswerRelevancyMetric(threshold=0.7),

199

MultimodalFaithfulnessMetric(threshold=0.8)

200

]

201

202

for metric in metrics:

203

metric.measure(test_case)

204

print(f"{metric.__class__.__name__}: {metric.score:.2f}")

205

```

206

207

### Multimodal Tool Correctness

208

209

Tool correctness for multimodal contexts.

210

211

```python { .api }

212

class MultimodalToolCorrectnessMetric:

213

"""

214

Tool correctness for multimodal contexts.

215

216

Parameters:

217

- threshold (float): Success threshold (default: 0.5)

218

- model (Union[str, DeepEvalBaseMLLM], optional): Evaluation model

219

220

Required Test Case Parameters:

221

- TOOLS_CALLED

222

- EXPECTED_TOOLS

223

"""

224

```

225