0
# Synthesizer
1
2
Synthetic test data generation using various evolution strategies (reasoning, multi-context, concretizing, etc.) to create diverse and challenging test cases. Generate goldens from documents, contexts, or from scratch.
3
4
## Imports
5
6
```python
7
from deepeval.synthesizer import (
8
Synthesizer,
9
Evolution,
10
PromptEvolution,
11
FiltrationConfig,
12
EvolutionConfig,
13
StylingConfig,
14
ContextConstructionConfig
15
)
16
```
17
18
## Capabilities
19
20
### Synthesizer Class
21
22
Main class for generating synthetic test data.
23
24
```python { .api }
25
class Synthesizer:
26
"""
27
Generates synthetic test data and goldens.
28
29
Parameters:
30
- model (Union[str, DeepEvalBaseLLM], optional): Model for generation
31
- async_mode (bool): Async mode (default: True)
32
- max_concurrent (int): Max concurrent tasks (default: 100)
33
- filtration_config (FiltrationConfig, optional): Filtration configuration
34
- evolution_config (EvolutionConfig, optional): Evolution configuration
35
- styling_config (StylingConfig, optional): Styling configuration
36
- cost_tracking (bool): Track API costs (default: False)
37
38
Methods:
39
- generate_goldens_from_docs(document_paths, **kwargs) -> List[Golden]
40
- a_generate_goldens_from_docs(document_paths, **kwargs) -> List[Golden]
41
- generate_goldens_from_contexts(contexts, **kwargs) -> List[Golden]
42
- a_generate_goldens_from_contexts(contexts, **kwargs) -> List[Golden]
43
- generate_goldens_from_scratch(num_goldens, **kwargs) -> List[Golden]
44
- a_generate_goldens_from_scratch(num_goldens, **kwargs) -> List[Golden]
45
- generate_goldens_from_goldens(goldens, **kwargs) -> List[Golden]
46
- a_generate_goldens_from_goldens(goldens, **kwargs) -> List[Golden]
47
- save_as(file_type, directory, file_name=None): Save synthetic goldens
48
- to_pandas() -> pd.DataFrame: Convert to pandas DataFrame
49
"""
50
```
51
52
### Evolution Types
53
54
Input evolution strategies for creating diverse test cases.
55
56
```python { .api }
57
class Evolution:
58
"""
59
Enum of input evolution strategies.
60
61
Values:
62
- REASONING: Add reasoning complexity
63
- MULTICONTEXT: Require multiple contexts
64
- CONCRETIZING: Make more concrete/specific
65
- CONSTRAINED: Add constraints
66
- COMPARATIVE: Add comparisons
67
- HYPOTHETICAL: Make hypothetical
68
- IN_BREADTH: Broaden scope
69
"""
70
71
class PromptEvolution:
72
"""
73
Enum of prompt evolution (for scratch generation).
74
75
Values:
76
- REASONING
77
- CONCRETIZING
78
- CONSTRAINED
79
- COMPARATIVE
80
- HYPOTHETICAL
81
- IN_BREADTH
82
"""
83
```
84
85
### Configuration Classes
86
87
```python { .api }
88
class FiltrationConfig:
89
"""
90
Configuration for synthetic data filtration.
91
92
Parameters:
93
- synthetic_input_quality_threshold (float): Quality threshold (default: 0.5)
94
- max_quality_retries (int): Max retries for quality (default: 3)
95
- critic_model (Union[str, DeepEvalBaseLLM], optional): Critic model for quality assessment
96
"""
97
98
class EvolutionConfig:
99
"""
100
Configuration for input evolution.
101
102
Parameters:
103
- num_evolutions (int): Number of evolution iterations (default: 1)
104
- evolutions (Dict[Evolution, float]): Evolution types and weights (default: equal distribution)
105
"""
106
107
class StylingConfig:
108
"""
109
Configuration for output styling.
110
111
Parameters:
112
- scenario (str, optional): Scenario description
113
- task (str, optional): Task description
114
- input_format (str, optional): Input format specification
115
- expected_output_format (str, optional): Expected output format
116
"""
117
118
class ContextConstructionConfig:
119
"""
120
Configuration for context construction from documents.
121
122
Parameters:
123
- embedder (Union[str, DeepEvalBaseEmbeddingModel], optional): Embedding model
124
- critic_model (Union[str, DeepEvalBaseLLM], optional): Critic model
125
- encoding (str, optional): Text encoding
126
- max_contexts_per_document (int): Max contexts per doc (default: 3)
127
- min_contexts_per_document (int): Min contexts per doc (default: 1)
128
- max_context_length (int): Max context length in chunks (default: 3)
129
- min_context_length (int): Min context length in chunks (default: 1)
130
- chunk_size (int): Chunk size in characters (default: 1024)
131
- chunk_overlap (int): Chunk overlap (default: 0)
132
- context_quality_threshold (float): Quality threshold (default: 0.5)
133
- context_similarity_threshold (float): Similarity threshold (default: 0.0)
134
- max_retries (int): Max retries (default: 3)
135
"""
136
```
137
138
## Usage Examples
139
140
### Generate from Documents
141
142
```python
143
from deepeval.synthesizer import Synthesizer
144
145
synthesizer = Synthesizer(model="gpt-4")
146
147
# Generate goldens from documents
148
goldens = synthesizer.generate_goldens_from_docs(
149
document_paths=[
150
"./docs/product_manual.pdf",
151
"./docs/faq.txt",
152
"./docs/user_guide.docx"
153
],
154
max_goldens_per_context=2,
155
include_expected_output=True
156
)
157
158
print(f"Generated {len(goldens)} goldens")
159
for golden in goldens[:3]:
160
print(f"Input: {golden.input}")
161
print(f"Expected: {golden.expected_output}\n")
162
163
# Save to file
164
synthesizer.save_as(
165
file_type="json",
166
directory="./synthetic_data",
167
file_name="doc_goldens"
168
)
169
```
170
171
### Generate from Contexts
172
173
```python
174
from deepeval.synthesizer import Synthesizer
175
176
synthesizer = Synthesizer()
177
178
# Generate from predefined contexts
179
contexts = [
180
["Our return policy allows 30-day full refunds"],
181
["Shipping takes 3-5 business days for US orders"],
182
["Premium members get free expedited shipping"]
183
]
184
185
goldens = synthesizer.generate_goldens_from_contexts(
186
contexts=contexts,
187
max_goldens_per_context=3,
188
include_expected_output=True
189
)
190
```
191
192
### Generate from Scratch
193
194
```python
195
from deepeval.synthesizer import Synthesizer, StylingConfig
196
197
synthesizer = Synthesizer(
198
styling_config=StylingConfig(
199
scenario="Customer support for an e-commerce platform",
200
task="Answer customer questions about products, shipping, and returns",
201
input_format="Natural language questions",
202
expected_output_format="Helpful, concise answers"
203
)
204
)
205
206
# Generate from scratch using styling config
207
goldens = synthesizer.generate_goldens_from_scratch(
208
num_goldens=50
209
)
210
211
print(f"Generated {len(goldens)} synthetic goldens")
212
```
213
214
### Apply Evolution Strategies
215
216
```python
217
from deepeval.synthesizer import Synthesizer, EvolutionConfig, Evolution
218
219
# Configure evolution strategies
220
evolution_config = EvolutionConfig(
221
num_evolutions=2, # Apply 2 rounds of evolution
222
evolutions={
223
Evolution.REASONING: 0.3, # 30% reasoning
224
Evolution.MULTICONTEXT: 0.2, # 20% multi-context
225
Evolution.CONCRETIZING: 0.2, # 20% concretizing
226
Evolution.CONSTRAINED: 0.15, # 15% constrained
227
Evolution.COMPARATIVE: 0.15 # 15% comparative
228
}
229
)
230
231
synthesizer = Synthesizer(evolution_config=evolution_config)
232
233
goldens = synthesizer.generate_goldens_from_docs(
234
document_paths=["./docs/guide.pdf"],
235
max_goldens_per_context=3
236
)
237
```
238
239
### Quality Filtration
240
241
```python
242
from deepeval.synthesizer import Synthesizer, FiltrationConfig
243
244
# Configure quality filtration
245
filtration_config = FiltrationConfig(
246
synthetic_input_quality_threshold=0.7, # Higher quality threshold
247
max_quality_retries=5, # More retry attempts
248
critic_model="gpt-4" # Use GPT-4 as quality critic
249
)
250
251
synthesizer = Synthesizer(
252
filtration_config=filtration_config,
253
cost_tracking=True # Track API costs
254
)
255
256
goldens = synthesizer.generate_goldens_from_contexts(
257
contexts=[["High-quality context about AI"]],
258
max_goldens_per_context=5
259
)
260
261
# Only high-quality goldens will be generated
262
```
263
264
### Custom Context Construction
265
266
```python
267
from deepeval.synthesizer import Synthesizer, ContextConstructionConfig
268
from deepeval.models import OpenAIEmbeddingModel
269
270
# Configure context construction
271
context_config = ContextConstructionConfig(
272
embedder=OpenAIEmbeddingModel(model="text-embedding-3-large"),
273
chunk_size=512, # Smaller chunks
274
chunk_overlap=50, # Some overlap
275
max_contexts_per_document=5,
276
min_context_length=2, # At least 2 chunks per context
277
max_context_length=4, # At most 4 chunks per context
278
context_quality_threshold=0.6,
279
context_similarity_threshold=0.3 # Avoid very similar contexts
280
)
281
282
synthesizer = Synthesizer()
283
284
goldens = synthesizer.generate_goldens_from_docs(
285
document_paths=["./large_document.pdf"],
286
context_construction_config=context_config,
287
max_goldens_per_context=3
288
)
289
```
290
291
### Evolve Existing Goldens
292
293
```python
294
from deepeval.synthesizer import Synthesizer
295
from deepeval.dataset import Golden
296
297
# Existing goldens
298
existing_goldens = [
299
Golden(input="What is Python?", expected_output="Python is a programming language"),
300
Golden(input="What is Java?", expected_output="Java is a programming language")
301
]
302
303
synthesizer = Synthesizer()
304
305
# Generate more goldens based on existing ones
306
new_goldens = synthesizer.generate_goldens_from_goldens(
307
goldens=existing_goldens,
308
max_goldens_per_golden=3, # Generate 3 variations per golden
309
include_expected_output=True
310
)
311
312
print(f"Generated {len(new_goldens)} new goldens from {len(existing_goldens)} existing")
313
```
314
315
### Async Generation
316
317
```python
318
import asyncio
319
from deepeval.synthesizer import Synthesizer
320
321
async def generate_data():
322
synthesizer = Synthesizer(
323
async_mode=True,
324
max_concurrent=50 # Higher concurrency
325
)
326
327
# Async generation
328
goldens = await synthesizer.a_generate_goldens_from_docs(
329
document_paths=["./doc1.pdf", "./doc2.pdf"],
330
max_goldens_per_context=5
331
)
332
333
return goldens
334
335
# Run async
336
goldens = asyncio.run(generate_data())
337
```
338
339
### Save and Export
340
341
```python
342
from deepeval.synthesizer import Synthesizer
343
344
synthesizer = Synthesizer()
345
goldens = synthesizer.generate_goldens_from_scratch(num_goldens=100)
346
347
# Save as JSON
348
synthesizer.save_as(
349
file_type="json",
350
directory="./data",
351
file_name="synthetic_goldens"
352
)
353
354
# Save as CSV
355
synthesizer.save_as(
356
file_type="csv",
357
directory="./data",
358
file_name="synthetic_goldens"
359
)
360
361
# Convert to pandas DataFrame for analysis
362
df = synthesizer.to_pandas()
363
print(df.head())
364
print(df.describe())
365
```
366
367
### Complete Example
368
369
```python
370
from deepeval.synthesizer import (
371
Synthesizer,
372
EvolutionConfig,
373
Evolution,
374
FiltrationConfig,
375
StylingConfig,
376
ContextConstructionConfig
377
)
378
from deepeval.models import GPTModel, OpenAIEmbeddingModel
379
380
# Configure synthesizer with all options
381
synthesizer = Synthesizer(
382
model=GPTModel(model="gpt-4"),
383
async_mode=True,
384
max_concurrent=20,
385
evolution_config=EvolutionConfig(
386
num_evolutions=2,
387
evolutions={
388
Evolution.REASONING: 0.4,
389
Evolution.MULTICONTEXT: 0.3,
390
Evolution.CONCRETIZING: 0.3
391
}
392
),
393
filtration_config=FiltrationConfig(
394
synthetic_input_quality_threshold=0.7,
395
max_quality_retries=3,
396
critic_model="gpt-4"
397
),
398
styling_config=StylingConfig(
399
scenario="Technical support for software products",
400
task="Help users troubleshoot issues",
401
input_format="User problem descriptions",
402
expected_output_format="Step-by-step troubleshooting guides"
403
),
404
cost_tracking=True
405
)
406
407
# Generate high-quality synthetic data
408
goldens = synthesizer.generate_goldens_from_docs(
409
document_paths=["./technical_docs.pdf"],
410
context_construction_config=ContextConstructionConfig(
411
embedder=OpenAIEmbeddingModel(),
412
chunk_size=1024,
413
max_contexts_per_document=10
414
),
415
max_goldens_per_context=2,
416
include_expected_output=True
417
)
418
419
# Save results
420
synthesizer.save_as(
421
file_type="json",
422
directory="./synthetic_data",
423
file_name="technical_support_goldens"
424
)
425
426
print(f"Generated {len(goldens)} high-quality synthetic goldens")
427
```
428