Tessl Tile for pypi/deepeval@3.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agentic-metrics.md benchmarks.md content-quality-metrics.md conversational-metrics.md core-evaluation.md custom-metrics.md dataset.md index.md integrations.md models.md multimodal-metrics.md rag-metrics.md synthesizer.md test-cases.md tracing.md

benchmarks.mddocs/

0
# Benchmarks
1

2
Pre-built benchmarks for evaluating LLMs on standard datasets like MMLU, HellaSwag, GSM8K, HumanEval, and more. Easily benchmark any LLM in under 10 lines of code.
3

4
## Imports
5

6
```python
7
from deepeval.benchmarks import (
8
    # Main benchmarks
9
    MMLU,
10
    HellaSwag,
11
    GSM8K,
12
    HumanEval,
13
    BigBenchHard,
14
    DROP,
15
    TruthfulQA,
16
    SQuAD,
17
    MathQA,
18
    LogiQA,
19
    BoolQ,
20
    ARC,
21
    BBQ,
22
    LAMBADA,
23
    Winogrande,
24
    EquityMedQA,
25
    IFEval,
26
    # Modes and tasks
27
    ARCMode,
28
    TruthfulQAMode
29
)
30
```
31

32
## Capabilities
33

34
### MMLU (Massive Multitask Language Understanding)
35

36
Comprehensive benchmark testing knowledge across 57 subjects.
37

38
```python { .api }
39
class MMLU:
40
    """
41
    Massive Multitask Language Understanding benchmark.
42

43
    Parameters:
44
    - tasks (List[MMLUTask], optional): Specific tasks to evaluate
45
    - n_shots (int, optional): Number of few-shot examples
46
    - n_problems (int, optional): Number of problems per task
47

48
    Methods:
49
    - evaluate(model: DeepEvalBaseLLM) -> BenchmarkResult
50
    """
51
```
52

53
Usage:
54

55
```python
56
from deepeval.benchmarks import MMLU
57
from deepeval.models import GPTModel
58

59
# Create model
60
model = GPTModel(model="gpt-4")
61

62
# Run full benchmark
63
benchmark = MMLU()
64
result = benchmark.evaluate(model)
65

66
print(f"Overall Score: {result.overall_score}")
67
print(f"Results by task: {result.task_scores}")
68
```
69

70
### HellaSwag
71

72
Commonsense reasoning benchmark.
73

74
```python { .api }
75
class HellaSwag:
76
    """
77
    HellaSwag benchmark for commonsense reasoning.
78

79
    Parameters:
80
    - tasks (List[HellaSwagTask], optional): Specific tasks
81
    - n_shots (int, optional): Number of few-shot examples
82
    - n_problems (int, optional): Number of problems
83
    """
84
```
85

86
### GSM8K
87

88
Grade School Math benchmark.
89

90
```python { .api }
91
class GSM8K:
92
    """
93
    Grade School Math 8K benchmark.
94

95
    Parameters:
96
    - n_shots (int, optional): Number of few-shot examples
97
    - n_problems (int, optional): Number of problems
98
    """
99
```
100

101
### HumanEval
102

103
Code generation benchmark.
104

105
```python { .api }
106
class HumanEval:
107
    """
108
    HumanEval benchmark for code generation.
109

110
    Parameters:
111
    - tasks (List[HumanEvalTask], optional): Specific tasks
112
    - n_problems (int, optional): Number of problems
113
    """
114
```
115

116
### BigBenchHard
117

118
Challenging reasoning tasks from Big Bench.
119

120
```python { .api }
121
class BigBenchHard:
122
    """
123
    Big Bench Hard benchmark.
124

125
    Parameters:
126
    - tasks (List[BigBenchHardTask], optional): Specific tasks
127
    - n_shots (int, optional): Number of few-shot examples
128
    """
129
```
130

131
### Other Benchmarks
132

133
```python { .api }
134
class DROP:
135
    """Discrete Reasoning Over Paragraphs benchmark."""
136

137
class TruthfulQA:
138
    """TruthfulQA benchmark for truthfulness."""
139

140
class SQuAD:
141
    """Stanford Question Answering Dataset."""
142

143
class MathQA:
144
    """Math Question Answering benchmark."""
145

146
class LogiQA:
147
    """Logical reasoning benchmark."""
148

149
class BoolQ:
150
    """Boolean Questions benchmark."""
151

152
class ARC:
153
    """AI2 Reasoning Challenge benchmark."""
154

155
class BBQ:
156
    """Bias Benchmark for QA."""
157

158
class LAMBADA:
159
    """LAMBADA benchmark for language understanding."""
160

161
class Winogrande:
162
    """Winogrande benchmark for commonsense reasoning."""
163

164
class EquityMedQA:
165
    """Equity in Medical QA benchmark."""
166

167
class IFEval:
168
    """Instruction Following Evaluation benchmark."""
169
```
170

171
## Usage Examples
172

173
### Simple Benchmark Evaluation
174

175
```python
176
from deepeval.benchmarks import GSM8K
177
from deepeval.models import GPTModel
178

179
# Evaluate on GSM8K
180
model = GPTModel(model="gpt-4")
181
benchmark = GSM8K(n_problems=100)
182

183
result = benchmark.evaluate(model)
184
print(f"Score: {result.overall_score}")
185
```
186

187
### Compare Multiple Models
188

189
```python
190
from deepeval.benchmarks import MMLU
191
from deepeval.models import GPTModel, AnthropicModel
192

193
models = {
194
    "GPT-4": GPTModel(model="gpt-4"),
195
    "Claude": AnthropicModel(model="claude-3-5-sonnet-20241022")
196
}
197

198
benchmark = MMLU(n_problems=50)
199

200
for name, model in models.items():
201
    result = benchmark.evaluate(model)
202
    print(f"{name}: {result.overall_score:.2f}")
203
```
204

205
### Specific Task Evaluation
206

207
```python
208
from deepeval.benchmarks import MMLU
209
from deepeval.benchmarks.tasks import MMLUTask
210

211
# Evaluate only on specific subjects
212
benchmark = MMLU(
213
    tasks=[MMLUTask.MATHEMATICS, MMLUTask.COMPUTER_SCIENCE],
214
    n_shots=5
215
)
216

217
result = benchmark.evaluate(model)
218
```
219

220
### Save Benchmark Results
221

222
```python
223
from deepeval.benchmarks import HumanEval
224

225
benchmark = HumanEval()
226
result = benchmark.evaluate(model)
227

228
# Save to file
229
result.save("./benchmark_results/humaneval_gpt4.json")
230
```
231

Version

Tile

Files

benchmarks.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

benchmarks.mddocs/