or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agentic-metrics.mdbenchmarks.mdcontent-quality-metrics.mdconversational-metrics.mdcore-evaluation.mdcustom-metrics.mddataset.mdindex.mdintegrations.mdmodels.mdmultimodal-metrics.mdrag-metrics.mdsynthesizer.mdtest-cases.mdtracing.md

benchmarks.mddocs/

0

# Benchmarks

1

2

Pre-built benchmarks for evaluating LLMs on standard datasets like MMLU, HellaSwag, GSM8K, HumanEval, and more. Easily benchmark any LLM in under 10 lines of code.

3

4

## Imports

5

6

```python

7

from deepeval.benchmarks import (

8

# Main benchmarks

9

MMLU,

10

HellaSwag,

11

GSM8K,

12

HumanEval,

13

BigBenchHard,

14

DROP,

15

TruthfulQA,

16

SQuAD,

17

MathQA,

18

LogiQA,

19

BoolQ,

20

ARC,

21

BBQ,

22

LAMBADA,

23

Winogrande,

24

EquityMedQA,

25

IFEval,

26

# Modes and tasks

27

ARCMode,

28

TruthfulQAMode

29

)

30

```

31

32

## Capabilities

33

34

### MMLU (Massive Multitask Language Understanding)

35

36

Comprehensive benchmark testing knowledge across 57 subjects.

37

38

```python { .api }

39

class MMLU:

40

"""

41

Massive Multitask Language Understanding benchmark.

42

43

Parameters:

44

- tasks (List[MMLUTask], optional): Specific tasks to evaluate

45

- n_shots (int, optional): Number of few-shot examples

46

- n_problems (int, optional): Number of problems per task

47

48

Methods:

49

- evaluate(model: DeepEvalBaseLLM) -> BenchmarkResult

50

"""

51

```

52

53

Usage:

54

55

```python

56

from deepeval.benchmarks import MMLU

57

from deepeval.models import GPTModel

58

59

# Create model

60

model = GPTModel(model="gpt-4")

61

62

# Run full benchmark

63

benchmark = MMLU()

64

result = benchmark.evaluate(model)

65

66

print(f"Overall Score: {result.overall_score}")

67

print(f"Results by task: {result.task_scores}")

68

```

69

70

### HellaSwag

71

72

Commonsense reasoning benchmark.

73

74

```python { .api }

75

class HellaSwag:

76

"""

77

HellaSwag benchmark for commonsense reasoning.

78

79

Parameters:

80

- tasks (List[HellaSwagTask], optional): Specific tasks

81

- n_shots (int, optional): Number of few-shot examples

82

- n_problems (int, optional): Number of problems

83

"""

84

```

85

86

### GSM8K

87

88

Grade School Math benchmark.

89

90

```python { .api }

91

class GSM8K:

92

"""

93

Grade School Math 8K benchmark.

94

95

Parameters:

96

- n_shots (int, optional): Number of few-shot examples

97

- n_problems (int, optional): Number of problems

98

"""

99

```

100

101

### HumanEval

102

103

Code generation benchmark.

104

105

```python { .api }

106

class HumanEval:

107

"""

108

HumanEval benchmark for code generation.

109

110

Parameters:

111

- tasks (List[HumanEvalTask], optional): Specific tasks

112

- n_problems (int, optional): Number of problems

113

"""

114

```

115

116

### BigBenchHard

117

118

Challenging reasoning tasks from Big Bench.

119

120

```python { .api }

121

class BigBenchHard:

122

"""

123

Big Bench Hard benchmark.

124

125

Parameters:

126

- tasks (List[BigBenchHardTask], optional): Specific tasks

127

- n_shots (int, optional): Number of few-shot examples

128

"""

129

```

130

131

### Other Benchmarks

132

133

```python { .api }

134

class DROP:

135

"""Discrete Reasoning Over Paragraphs benchmark."""

136

137

class TruthfulQA:

138

"""TruthfulQA benchmark for truthfulness."""

139

140

class SQuAD:

141

"""Stanford Question Answering Dataset."""

142

143

class MathQA:

144

"""Math Question Answering benchmark."""

145

146

class LogiQA:

147

"""Logical reasoning benchmark."""

148

149

class BoolQ:

150

"""Boolean Questions benchmark."""

151

152

class ARC:

153

"""AI2 Reasoning Challenge benchmark."""

154

155

class BBQ:

156

"""Bias Benchmark for QA."""

157

158

class LAMBADA:

159

"""LAMBADA benchmark for language understanding."""

160

161

class Winogrande:

162

"""Winogrande benchmark for commonsense reasoning."""

163

164

class EquityMedQA:

165

"""Equity in Medical QA benchmark."""

166

167

class IFEval:

168

"""Instruction Following Evaluation benchmark."""

169

```

170

171

## Usage Examples

172

173

### Simple Benchmark Evaluation

174

175

```python

176

from deepeval.benchmarks import GSM8K

177

from deepeval.models import GPTModel

178

179

# Evaluate on GSM8K

180

model = GPTModel(model="gpt-4")

181

benchmark = GSM8K(n_problems=100)

182

183

result = benchmark.evaluate(model)

184

print(f"Score: {result.overall_score}")

185

```

186

187

### Compare Multiple Models

188

189

```python

190

from deepeval.benchmarks import MMLU

191

from deepeval.models import GPTModel, AnthropicModel

192

193

models = {

194

"GPT-4": GPTModel(model="gpt-4"),

195

"Claude": AnthropicModel(model="claude-3-5-sonnet-20241022")

196

}

197

198

benchmark = MMLU(n_problems=50)

199

200

for name, model in models.items():

201

result = benchmark.evaluate(model)

202

print(f"{name}: {result.overall_score:.2f}")

203

```

204

205

### Specific Task Evaluation

206

207

```python

208

from deepeval.benchmarks import MMLU

209

from deepeval.benchmarks.tasks import MMLUTask

210

211

# Evaluate only on specific subjects

212

benchmark = MMLU(

213

tasks=[MMLUTask.MATHEMATICS, MMLUTask.COMPUTER_SCIENCE],

214

n_shots=5

215

)

216

217

result = benchmark.evaluate(model)

218

```

219

220

### Save Benchmark Results

221

222

```python

223

from deepeval.benchmarks import HumanEval

224

225

benchmark = HumanEval()

226

result = benchmark.evaluate(model)

227

228

# Save to file

229

result.save("./benchmark_results/humaneval_gpt4.json")

230

```

231