0
# Configuration
1
2
Comprehensive configuration system for customizing analysis depth, statistical computations, visualizations, and report output formats. The configuration system provides fine-grained control over every aspect of the profiling process.
3
4
## Capabilities
5
6
### Settings Class
7
8
Main configuration class providing comprehensive control over profiling behavior and report generation.
9
10
```python { .api }
11
class Settings:
12
def __init__(self, **kwargs):
13
"""
14
Initialize Settings with configuration parameters.
15
16
Parameters:
17
- **kwargs: configuration parameters for various analysis components
18
"""
19
20
# Core configuration sections
21
dataset: DatasetConfig
22
variables: VariablesConfig
23
correlations: CorrelationsConfig
24
interactions: InteractionsConfig
25
plot: PlotConfig
26
html: HtmlConfig
27
style: StyleConfig
28
29
# Global settings
30
title: str = "Profiling Report"
31
pool_size: int = 0
32
progress_bar: bool = True
33
lazy: bool = True
34
```
35
36
**Usage Example:**
37
38
```python
39
from ydata_profiling import ProfileReport
40
from ydata_profiling.config import Settings
41
42
# Create custom configuration
43
config = Settings()
44
config.title = "Custom Dataset Analysis"
45
config.pool_size = 4
46
config.progress_bar = True
47
48
# Apply configuration to report
49
report = ProfileReport(df, config=config)
50
report.to_file("custom_report.html")
51
```
52
53
### Configuration Loading
54
55
Load configuration from files or preset configurations.
56
57
```python { .api }
58
class Config:
59
@staticmethod
60
def get_config(config_file: Optional[Union[str, Path]] = None) -> Settings:
61
"""
62
Load configuration from file or return default configuration.
63
64
Parameters:
65
- config_file: path to YAML configuration file
66
67
Returns:
68
Settings object with loaded configuration
69
"""
70
```
71
72
**Usage Example:**
73
74
```python
75
from ydata_profiling.config import Config
76
from ydata_profiling import ProfileReport
77
78
# Load from configuration file
79
config = Config.get_config("my_config.yaml")
80
report = ProfileReport(df, config=config)
81
82
# Use preset configurations
83
minimal_report = ProfileReport(df, minimal=True)
84
explorative_report = ProfileReport(df, explorative=True)
85
sensitive_report = ProfileReport(df, sensitive=True)
86
```
87
88
### Dataset Configuration
89
90
Configuration for dataset-level metadata and processing options.
91
92
```python { .api }
93
class DatasetConfig:
94
"""Configuration for dataset-level settings."""
95
96
# Dataset metadata
97
description: str = ""
98
creator: str = ""
99
author: str = ""
100
copyright_holder: str = ""
101
copyright_year: str = ""
102
url: str = ""
103
104
# Processing options
105
sample: Optional[dict] = None
106
duplicates: Optional[dict] = None
107
```
108
109
**Usage Example:**
110
111
```python
112
config = Settings()
113
config.dataset.description = "Customer transaction data for Q4 2023"
114
config.dataset.creator = "Data Science Team"
115
config.dataset.author = "John Doe"
116
117
report = ProfileReport(df, config=config)
118
```
119
120
### Variables Configuration
121
122
Configuration for variable-specific analysis settings across different data types.
123
124
```python { .api }
125
class VariablesConfig:
126
"""Configuration for variable-specific analysis."""
127
128
# Variable type configurations
129
descriptions: dict = {}
130
131
# Type-specific settings
132
num: NumVarsConfig
133
cat: CatVarsConfig
134
bool: BoolVarsConfig
135
text: TextVarsConfig
136
file: FileVarsConfig
137
path: PathVarsConfig
138
image: ImageVarsConfig
139
url: UrlVarsConfig
140
timeseries: TimeseriesVarsConfig
141
```
142
143
```python { .api }
144
class NumVarsConfig:
145
"""Numeric variables configuration."""
146
147
low_categorical_threshold: int = 5
148
chi_squared_threshold: float = 0.999
149
skewness_threshold: int = 20
150
kurtosis_threshold: int = 20
151
152
class CatVarsConfig:
153
"""Categorical variables configuration."""
154
155
length: bool = True
156
characters: bool = True
157
words: bool = True
158
cardinality_threshold: int = 50
159
160
class TextVarsConfig:
161
"""Text variables configuration."""
162
163
length: bool = True
164
characters: bool = True
165
words: bool = True
166
redact: bool = False
167
```
168
169
**Usage Example:**
170
171
```python
172
config = Settings()
173
174
# Configure numeric variables
175
config.variables.num.low_categorical_threshold = 10
176
config.variables.num.skewness_threshold = 15
177
178
# Configure categorical variables
179
config.variables.cat.cardinality_threshold = 100
180
config.variables.cat.length = True
181
182
# Configure text variables
183
config.variables.text.redact = True # Hide sensitive text
184
185
report = ProfileReport(df, config=config)
186
```
187
188
### Correlation Configuration
189
190
Configuration for correlation analysis and visualization.
191
192
```python { .api }
193
class CorrelationsConfig:
194
"""Configuration for correlation analysis."""
195
196
pearson: CorrelationConfig
197
spearman: CorrelationConfig
198
kendall: CorrelationConfig
199
cramers: CorrelationConfig
200
phik: CorrelationConfig
201
auto: CorrelationConfig
202
203
class CorrelationConfig:
204
"""Individual correlation method configuration."""
205
206
calculate: bool = True
207
warn_high_cardinality: bool = True
208
threshold: float = 0.9
209
```
210
211
**Usage Example:**
212
213
```python
214
config = Settings()
215
216
# Enable/disable specific correlation methods
217
config.correlations.pearson.calculate = True
218
config.correlations.spearman.calculate = True
219
config.correlations.kendall.calculate = False
220
221
# Set correlation thresholds
222
config.correlations.pearson.threshold = 0.8
223
config.correlations.auto.warn_high_cardinality = True
224
225
report = ProfileReport(df, config=config)
226
```
227
228
### Plot Configuration
229
230
Configuration for visualizations and plotting options.
231
232
```python { .api }
233
class PlotConfig:
234
"""Configuration for plot generation."""
235
236
# Plot settings
237
histogram: dict = {}
238
correlation: dict = {}
239
missing: dict = {}
240
241
# Image settings
242
dpi: int = 800
243
image_format: str = "svg"
244
```
245
246
**Usage Example:**
247
248
```python
249
config = Settings()
250
251
# Configure plot settings
252
config.plot.dpi = 300
253
config.plot.image_format = "png"
254
255
# Configure histogram settings
256
config.plot.histogram = {
257
"bins": 50,
258
"max_bins": 250
259
}
260
261
# Configure correlation plots
262
config.plot.correlation = {
263
"cmap": "RdYlBu_r",
264
"bad": "#000000"
265
}
266
267
report = ProfileReport(df, config=config)
268
```
269
270
### HTML Configuration
271
272
Configuration for HTML report generation and styling.
273
274
```python { .api }
275
class HtmlConfig:
276
"""Configuration for HTML report generation."""
277
278
# Report structure
279
minify_html: bool = True
280
use_local_assets: bool = True
281
inline: bool = True
282
283
# Navigation and layout
284
navbar_show: bool = True
285
full_width: bool = False
286
287
# Content sections
288
style: dict = {}
289
```
290
291
**Usage Example:**
292
293
```python
294
config = Settings()
295
296
# Configure HTML output
297
config.html.minify_html = False # Keep HTML readable
298
config.html.full_width = True # Use full browser width
299
config.html.navbar_show = True # Show navigation bar
300
301
# Custom styling
302
config.html.style = {
303
"primary_color": "#337ab7",
304
"logo": "https://company.com/logo.png"
305
}
306
307
report = ProfileReport(df, config=config)
308
```
309
310
### Spark Configuration
311
312
Configuration for Spark DataFrame processing.
313
314
```python { .api }
315
class SparkSettings:
316
def __init__(self, **kwargs):
317
"""
318
Initialize Spark-specific configuration.
319
320
Parameters:
321
- **kwargs: Spark configuration parameters
322
"""
323
324
# Spark-specific settings
325
executor_memory: str = "2g"
326
executor_cores: int = 2
327
max_result_size: str = "1g"
328
```
329
330
**Usage Example:**
331
332
```python
333
from ydata_profiling.config import SparkSettings
334
from ydata_profiling import ProfileReport
335
336
# Configure Spark settings
337
spark_config = SparkSettings()
338
spark_config.executor_memory = "4g"
339
spark_config.executor_cores = 4
340
341
# Use with Spark DataFrame
342
from pyspark.sql import SparkSession
343
spark = SparkSession.builder.appName("Profiling").getOrCreate()
344
spark_df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)
345
346
report = ProfileReport(spark_df, config=spark_config)
347
```
348
349
### Configuration Files
350
351
YAML configuration file format for persistent settings.
352
353
**Example Configuration File (`config.yaml`):**
354
355
```yaml
356
title: "Production Data Report"
357
pool_size: 8
358
progress_bar: true
359
360
dataset:
361
description: "Customer transaction dataset"
362
creator: "Data Engineering Team"
363
364
variables:
365
num:
366
low_categorical_threshold: 10
367
skewness_threshold: 20
368
cat:
369
cardinality_threshold: 50
370
text:
371
redact: false
372
373
correlations:
374
pearson:
375
calculate: true
376
threshold: 0.9
377
spearman:
378
calculate: true
379
kendall:
380
calculate: false
381
382
plot:
383
dpi: 300
384
image_format: "png"
385
386
html:
387
minify_html: true
388
full_width: false
389
```
390
391
**Usage with Configuration File:**
392
393
```python
394
from ydata_profiling import ProfileReport
395
396
# Load configuration from file
397
report = ProfileReport(df, config_file="config.yaml")
398
report.to_file("production_report.html")
399
```
400
401
### SparkSettings Class
402
403
Specialized configuration class optimized for Spark DataFrames with performance-focused defaults.
404
405
```python { .api }
406
class SparkSettings(Settings):
407
"""
408
Specialized Settings class for Spark DataFrames with optimized configurations.
409
410
Inherits from Settings but with performance-focused defaults that disable
411
computationally expensive operations for large-scale Spark datasets.
412
"""
413
414
# Performance optimizations
415
infer_dtypes: bool = False
416
correlations: Dict[str, bool] = {
417
"spearman": True,
418
"pearson": True,
419
"auto": False, # Disabled for performance
420
"phi_k": False,
421
"cramers": False,
422
"kendall": False
423
}
424
425
# Disabled heavy computations
426
interactions_continuous: bool = False
427
missing_diagrams: Dict[str, bool] = {
428
"bar": False,
429
"matrix": False,
430
"dendrogram": False,
431
"heatmap": False
432
}
433
434
# Reduced sampling
435
samples_tail: int = 0
436
samples_random: int = 0
437
```
438
439
**Usage Example:**
440
441
```python
442
from ydata_profiling import ProfileReport
443
from ydata_profiling.config import SparkSettings
444
from pyspark.sql import SparkSession
445
446
# Create Spark DataFrame
447
spark = SparkSession.builder.appName("Profiling").getOrCreate()
448
spark_df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)
449
450
# Use SparkSettings for optimal performance
451
config = SparkSettings()
452
config.title = "Large Dataset Analysis"
453
454
report = ProfileReport(spark_df, config=config)
455
report.to_file("spark_report.html")
456
```
457
458
### Configuration Methods
459
460
Advanced methods for managing and updating configuration settings.
461
462
```python { .api }
463
def update(self, updates: dict) -> 'Settings':
464
"""
465
Merge updates with existing configuration.
466
467
Parameters:
468
- updates: dictionary with configuration updates
469
470
Returns:
471
Updated Settings instance
472
"""
473
474
@staticmethod
475
def from_file(config_file: Union[Path, str]) -> 'Settings':
476
"""
477
Create Settings from YAML configuration file.
478
479
Parameters:
480
- config_file: path to YAML configuration file
481
482
Returns:
483
Settings instance with loaded configuration
484
"""
485
486
@property
487
def primary_color(self) -> str:
488
"""
489
Get primary color for backward compatibility.
490
491
Returns:
492
Primary color from style configuration
493
"""
494
```
495
496
**Usage Example:**
497
498
```python
499
from ydata_profiling.config import Settings
500
from pathlib import Path
501
502
# Load from file
503
config = Settings.from_file("custom_config.yaml")
504
505
# Update specific settings
506
updates = {
507
"title": "Updated Report Title",
508
"plot": {
509
"dpi": 600,
510
"image_format": "png"
511
},
512
"vars": {
513
"cat": {
514
"redact": True
515
}
516
}
517
}
518
519
updated_config = config.update(updates)
520
521
# Use updated configuration
522
report = ProfileReport(df, config=updated_config)
523
```
524
525
### Preset Configurations
526
527
Built-in configuration presets for common use cases.
528
529
**Built-in Presets:**
530
531
```python
532
# Minimal mode - fast profiling with reduced computation
533
ProfileReport(df, minimal=True)
534
535
# Explorative mode - comprehensive analysis with all features
536
ProfileReport(df, explorative=True)
537
538
# Sensitive mode - privacy-aware profiling
539
ProfileReport(df, sensitive=True)
540
541
# Time-series mode - specialized for time-series data
542
ProfileReport(df, tsmode=True, sortby='timestamp')
543
```
544
545
**Preset Details:**
546
547
- **Minimal**: Disables correlations, missing diagrams, and type inference for speed
548
- **Explorative**: Enables advanced text analysis, file analysis, and memory profiling
549
- **Sensitive**: Redacts categorical/text values and disables sample display
550
- **Time-series**: Enables autocorrelation analysis and time-based sorting
551
```