0
# Benchmarks
1
2
Pre-built benchmarks for evaluating LLMs on standard datasets like MMLU, HellaSwag, GSM8K, HumanEval, and more. Easily benchmark any LLM in under 10 lines of code.
3
4
## Imports
5
6
```python
7
from deepeval.benchmarks import (
8
# Main benchmarks
9
MMLU,
10
HellaSwag,
11
GSM8K,
12
HumanEval,
13
BigBenchHard,
14
DROP,
15
TruthfulQA,
16
SQuAD,
17
MathQA,
18
LogiQA,
19
BoolQ,
20
ARC,
21
BBQ,
22
LAMBADA,
23
Winogrande,
24
EquityMedQA,
25
IFEval,
26
# Modes and tasks
27
ARCMode,
28
TruthfulQAMode
29
)
30
```
31
32
## Capabilities
33
34
### MMLU (Massive Multitask Language Understanding)
35
36
Comprehensive benchmark testing knowledge across 57 subjects.
37
38
```python { .api }
39
class MMLU:
40
"""
41
Massive Multitask Language Understanding benchmark.
42
43
Parameters:
44
- tasks (List[MMLUTask], optional): Specific tasks to evaluate
45
- n_shots (int, optional): Number of few-shot examples
46
- n_problems (int, optional): Number of problems per task
47
48
Methods:
49
- evaluate(model: DeepEvalBaseLLM) -> BenchmarkResult
50
"""
51
```
52
53
Usage:
54
55
```python
56
from deepeval.benchmarks import MMLU
57
from deepeval.models import GPTModel
58
59
# Create model
60
model = GPTModel(model="gpt-4")
61
62
# Run full benchmark
63
benchmark = MMLU()
64
result = benchmark.evaluate(model)
65
66
print(f"Overall Score: {result.overall_score}")
67
print(f"Results by task: {result.task_scores}")
68
```
69
70
### HellaSwag
71
72
Commonsense reasoning benchmark.
73
74
```python { .api }
75
class HellaSwag:
76
"""
77
HellaSwag benchmark for commonsense reasoning.
78
79
Parameters:
80
- tasks (List[HellaSwagTask], optional): Specific tasks
81
- n_shots (int, optional): Number of few-shot examples
82
- n_problems (int, optional): Number of problems
83
"""
84
```
85
86
### GSM8K
87
88
Grade School Math benchmark.
89
90
```python { .api }
91
class GSM8K:
92
"""
93
Grade School Math 8K benchmark.
94
95
Parameters:
96
- n_shots (int, optional): Number of few-shot examples
97
- n_problems (int, optional): Number of problems
98
"""
99
```
100
101
### HumanEval
102
103
Code generation benchmark.
104
105
```python { .api }
106
class HumanEval:
107
"""
108
HumanEval benchmark for code generation.
109
110
Parameters:
111
- tasks (List[HumanEvalTask], optional): Specific tasks
112
- n_problems (int, optional): Number of problems
113
"""
114
```
115
116
### BigBenchHard
117
118
Challenging reasoning tasks from Big Bench.
119
120
```python { .api }
121
class BigBenchHard:
122
"""
123
Big Bench Hard benchmark.
124
125
Parameters:
126
- tasks (List[BigBenchHardTask], optional): Specific tasks
127
- n_shots (int, optional): Number of few-shot examples
128
"""
129
```
130
131
### Other Benchmarks
132
133
```python { .api }
134
class DROP:
135
"""Discrete Reasoning Over Paragraphs benchmark."""
136
137
class TruthfulQA:
138
"""TruthfulQA benchmark for truthfulness."""
139
140
class SQuAD:
141
"""Stanford Question Answering Dataset."""
142
143
class MathQA:
144
"""Math Question Answering benchmark."""
145
146
class LogiQA:
147
"""Logical reasoning benchmark."""
148
149
class BoolQ:
150
"""Boolean Questions benchmark."""
151
152
class ARC:
153
"""AI2 Reasoning Challenge benchmark."""
154
155
class BBQ:
156
"""Bias Benchmark for QA."""
157
158
class LAMBADA:
159
"""LAMBADA benchmark for language understanding."""
160
161
class Winogrande:
162
"""Winogrande benchmark for commonsense reasoning."""
163
164
class EquityMedQA:
165
"""Equity in Medical QA benchmark."""
166
167
class IFEval:
168
"""Instruction Following Evaluation benchmark."""
169
```
170
171
## Usage Examples
172
173
### Simple Benchmark Evaluation
174
175
```python
176
from deepeval.benchmarks import GSM8K
177
from deepeval.models import GPTModel
178
179
# Evaluate on GSM8K
180
model = GPTModel(model="gpt-4")
181
benchmark = GSM8K(n_problems=100)
182
183
result = benchmark.evaluate(model)
184
print(f"Score: {result.overall_score}")
185
```
186
187
### Compare Multiple Models
188
189
```python
190
from deepeval.benchmarks import MMLU
191
from deepeval.models import GPTModel, AnthropicModel
192
193
models = {
194
"GPT-4": GPTModel(model="gpt-4"),
195
"Claude": AnthropicModel(model="claude-3-5-sonnet-20241022")
196
}
197
198
benchmark = MMLU(n_problems=50)
199
200
for name, model in models.items():
201
result = benchmark.evaluate(model)
202
print(f"{name}: {result.overall_score:.2f}")
203
```
204
205
### Specific Task Evaluation
206
207
```python
208
from deepeval.benchmarks import MMLU
209
from deepeval.benchmarks.tasks import MMLUTask
210
211
# Evaluate only on specific subjects
212
benchmark = MMLU(
213
tasks=[MMLUTask.MATHEMATICS, MMLUTask.COMPUTER_SCIENCE],
214
n_shots=5
215
)
216
217
result = benchmark.evaluate(model)
218
```
219
220
### Save Benchmark Results
221
222
```python
223
from deepeval.benchmarks import HumanEval
224
225
benchmark = HumanEval()
226
result = benchmark.evaluate(model)
227
228
# Save to file
229
result.save("./benchmark_results/humaneval_gpt4.json")
230
```
231