0
# pandas-profiling
1
2
A Python library that provides comprehensive one-line Exploratory Data Analysis (EDA) for pandas DataFrames. It generates detailed profile reports including statistical summaries, data quality warnings, visualizations, and insights that go far beyond basic `df.describe()` functionality.
3
4
## Package Information
5
6
- **Package Name**: pandas-profiling
7
- **Language**: Python
8
- **Installation**: `pip install pandas-profiling`
9
- **Optional extras**: `pip install pandas-profiling[notebook,unicode]`
10
11
## Core Imports
12
13
```python
14
from pandas_profiling import ProfileReport
15
```
16
17
For dataset comparison:
18
19
```python
20
from pandas_profiling import compare
21
```
22
23
To enable pandas DataFrame.profile_report() method:
24
25
```python
26
import pandas_profiling # Adds profile_report() method to DataFrames
27
```
28
29
For configuration:
30
31
```python
32
from pandas_profiling.config import Settings
33
```
34
35
## Basic Usage
36
37
```python
38
import pandas as pd
39
from pandas_profiling import ProfileReport
40
41
# Load your data
42
df = pd.read_csv('your_data.csv')
43
44
# Generate profile report
45
profile = ProfileReport(df, title="Data Profile Report")
46
47
# View in Jupyter notebook
48
profile.to_widgets()
49
50
# Or export to HTML file
51
profile.to_file("profile_report.html")
52
53
# Or get as JSON
54
json_data = profile.to_json()
55
```
56
57
## Architecture
58
59
pandas-profiling is built around a modular architecture:
60
61
- **ProfileReport**: Central class that orchestrates data analysis and report generation
62
- **Configuration System**: Flexible settings management through the Settings class and configuration models
63
- **Analysis Pipeline**: Automated type inference, statistical analysis, and visualization generation
64
- **Export System**: Multiple output formats (HTML, JSON, Jupyter widgets)
65
- **pandas Integration**: Automatic DataFrame method extension for seamless workflow integration
66
67
## Types
68
69
```python { .api }
70
from typing import Any, Dict, List, Optional, Union, Tuple
71
from pathlib import Path
72
import pandas as pd
73
from visions import VisionsTypeset
74
75
# Key classes from pandas_profiling
76
class Settings: ... # Configuration management class
77
class BaseSummarizer: ... # Summary generation interface
78
```
79
80
## Capabilities
81
82
### Profile Report Generation
83
84
The core functionality for creating comprehensive data analysis reports from pandas DataFrames.
85
86
```python { .api }
87
class ProfileReport:
88
def __init__(
89
self,
90
df: Optional[pd.DataFrame] = None,
91
minimal: bool = False,
92
explorative: bool = False,
93
sensitive: bool = False,
94
dark_mode: bool = False,
95
orange_mode: bool = False,
96
tsmode: bool = False,
97
sortby: Optional[str] = None,
98
sample: Optional[dict] = None,
99
config_file: Union[Path, str] = None,
100
lazy: bool = True,
101
typeset: Optional[VisionsTypeset] = None,
102
summarizer: Optional[BaseSummarizer] = None,
103
config: Optional[Settings] = None,
104
**kwargs
105
):
106
"""
107
Generate a ProfileReport based on a pandas DataFrame.
108
109
Parameters:
110
- df: pandas DataFrame to analyze
111
- minimal: use minimal computation mode for faster processing
112
- explorative: enable advanced analysis features
113
- sensitive: enable privacy-aware mode for sensitive data
114
- dark_mode: apply dark theme styling
115
- orange_mode: apply orange theme styling
116
- tsmode: enable time series analysis mode
117
- sortby: column name for time series sorting
118
- sample: optional sample data dict with name, caption, data
119
- config_file: path to YAML configuration file
120
- lazy: compute analysis when needed (default True)
121
- typeset: custom type inference system
122
- summarizer: custom summary generation system
123
- config: Settings object for configuration
124
- **kwargs: additional configuration options
125
"""
126
```
127
128
### Report Export and Display
129
130
Methods for outputting and displaying the generated profile report.
131
132
```python { .api }
133
class ProfileReport:
134
def to_file(self, output_file: Union[str, Path], silent: bool = True) -> None:
135
"""
136
Export report to HTML or JSON file.
137
138
Parameters:
139
- output_file: path for output file (.html or .json extension)
140
- silent: suppress progress output
141
"""
142
143
def to_html(self) -> str:
144
"""
145
Get HTML representation of the report.
146
147
Returns:
148
str: Complete HTML report as string
149
"""
150
151
def to_json(self) -> str:
152
"""
153
Get JSON representation of the report.
154
155
Returns:
156
str: Complete report data as JSON string
157
"""
158
159
def to_widgets(self) -> Any:
160
"""
161
Display report as interactive Jupyter widgets.
162
163
Returns:
164
Widget object for Jupyter notebook display
165
"""
166
167
def to_notebook_iframe(self) -> None:
168
"""
169
Display report as embedded HTML iframe in Jupyter notebook.
170
"""
171
```
172
173
### Data Access and Analysis
174
175
Methods for accessing specific analysis results and data insights.
176
177
```python { .api }
178
class ProfileReport:
179
def get_description(self) -> dict:
180
"""
181
Get the complete analysis description dictionary.
182
183
Returns:
184
dict: Complete analysis results and metadata
185
"""
186
187
def get_duplicates(self) -> Optional[pd.DataFrame]:
188
"""
189
Get DataFrame containing duplicate rows.
190
191
Returns:
192
DataFrame or None: Duplicate rows if any exist
193
"""
194
195
def get_sample(self) -> dict:
196
"""
197
Get sample data information.
198
199
Returns:
200
dict: Sample data with metadata
201
"""
202
203
def get_rejected_variables(self) -> set:
204
"""
205
Get set of variable names that were rejected from analysis.
206
207
Returns:
208
set: Variable names excluded from the report
209
"""
210
```
211
212
### Report Comparison
213
214
Functionality for comparing multiple datasets and generating comparison reports.
215
216
```python { .api }
217
def compare(
218
reports: List[ProfileReport],
219
config: Optional[Settings] = None,
220
compute: bool = False
221
) -> ProfileReport:
222
"""
223
Compare multiple ProfileReport objects.
224
225
Parameters:
226
- reports: list of ProfileReport objects to compare
227
- config: optional Settings object for the merged report
228
- compute: recompute profiles using config (recommended for different settings)
229
230
Returns:
231
ProfileReport: Comparison report highlighting differences and similarities
232
"""
233
234
class ProfileReport:
235
def compare(
236
self,
237
other: ProfileReport,
238
config: Optional[Settings] = None
239
) -> ProfileReport:
240
"""
241
Compare this report with another ProfileReport.
242
243
Parameters:
244
- other: ProfileReport object to compare against
245
- config: optional Settings object for the merged report
246
247
Returns:
248
ProfileReport: Comparison report
249
"""
250
```
251
252
### Configuration Management
253
254
Comprehensive configuration system for customizing analysis and report generation.
255
256
```python { .api }
257
class Settings:
258
def __init__(self):
259
"""
260
Create new Settings configuration object with default values.
261
"""
262
263
def update(self, updates: dict) -> Settings:
264
"""
265
Update configuration with new values.
266
267
Parameters:
268
- updates: dictionary of configuration updates
269
270
Returns:
271
Settings: New Settings object with updated values
272
"""
273
274
@classmethod
275
def from_file(cls, config_file: Union[Path, str]) -> Settings:
276
"""
277
Load configuration from YAML file.
278
279
Parameters:
280
- config_file: path to YAML configuration file
281
282
Returns:
283
Settings: Configuration loaded from file
284
"""
285
286
class Config:
287
@staticmethod
288
def get_arg_groups(key: str) -> dict:
289
"""
290
Get predefined configuration group.
291
292
Parameters:
293
- key: configuration group name ('sensitive', 'explorative', 'dark_mode', 'orange_mode')
294
295
Returns:
296
dict: Configuration dictionary for the specified group
297
"""
298
299
@staticmethod
300
def shorthands(kwargs: dict, split: bool = True) -> Tuple[dict, dict]:
301
"""
302
Process configuration shortcuts and expand them.
303
304
Parameters:
305
- kwargs: configuration dictionary with potential shortcuts
306
- split: whether to split into shorthand and regular configs
307
308
Returns:
309
tuple: (shorthand_config, regular_config) dictionaries
310
"""
311
```
312
313
### DataFrame Integration
314
315
Automatic extension of pandas DataFrame with profiling functionality.
316
317
```python { .api }
318
# Automatically available after importing pandas_profiling
319
class DataFrame:
320
def profile_report(self, **kwargs) -> ProfileReport:
321
"""
322
Generate a ProfileReport for this DataFrame.
323
324
Parameters:
325
- **kwargs: arguments passed to ProfileReport constructor
326
327
Returns:
328
ProfileReport: Analysis report for this DataFrame
329
"""
330
```
331
332
### Cache Management
333
334
Methods for managing analysis computation caching.
335
336
```python { .api }
337
class ProfileReport:
338
def invalidate_cache(self, subset: Optional[str] = None) -> None:
339
"""
340
Clear cached computations to force recomputation.
341
342
Parameters:
343
- subset: optional cache subset to clear (None clears all)
344
"""
345
```
346
347
## Configuration Options
348
349
The Settings class provides extensive configuration through nested models:
350
351
### Variable Analysis Configuration
352
- **NumVars**: Numerical variable analysis settings (quantiles, thresholds)
353
- **CatVars**: Categorical variable analysis settings (length, character analysis)
354
- **BoolVars**: Boolean variable analysis settings
355
- **TimeseriesVars**: Time series analysis configuration
356
- **FileVars**: File path analysis settings
357
- **PathVars**: Path analysis settings
358
- **ImageVars**: Image analysis settings
359
- **UrlVars**: URL analysis settings
360
361
### Visualization Configuration
362
- **Plot**: General plotting configuration
363
- **Histogram**: Histogram visualization settings
364
- **CorrelationPlot**: Correlation plot settings
365
- **MissingPlot**: Missing data visualization
366
- **Html**: HTML output formatting
367
- **Style**: Visual styling and themes
368
369
### Analysis Configuration
370
- **Correlations**: Correlation analysis settings
371
- **Duplicates**: Duplicate detection configuration
372
- **Interactions**: Variable interaction analysis
373
- **Samples**: Data sampling configuration
374
- **Variables**: General variable analysis settings
375
376
### Output Configuration
377
- **Notebook**: Jupyter notebook integration settings
378
- **Iframe**: HTML iframe configuration
379
380
## Enums and Constants
381
382
```python { .api }
383
from enum import Enum
384
385
class Theme(Enum):
386
"""Available visual themes for reports."""
387
flatly = "flatly"
388
united = "united"
389
# Additional theme values available
390
391
class ImageType(Enum):
392
"""Supported image output formats."""
393
png = "png"
394
svg = "svg"
395
396
class IframeAttribute(Enum):
397
"""HTML iframe attribute options."""
398
srcdoc = "srcdoc"
399
src = "src"
400
```
401
402
## Usage Examples
403
404
### Time Series Analysis
405
406
```python
407
import pandas as pd
408
from pandas_profiling import ProfileReport
409
410
# Load time series data
411
df = pd.read_csv('timeseries_data.csv')
412
df['date'] = pd.to_datetime(df['date'])
413
414
# Generate time series report
415
profile = ProfileReport(
416
df,
417
title="Time Series Analysis",
418
tsmode=True,
419
sortby='date'
420
)
421
profile.to_file("timeseries_report.html")
422
```
423
424
### Sensitive Data Handling
425
426
```python
427
from pandas_profiling import ProfileReport
428
429
# Generate privacy-aware report
430
profile = ProfileReport(
431
df,
432
title="Sensitive Data Report",
433
sensitive=True # Redacts potentially sensitive information
434
)
435
profile.to_widgets()
436
```
437
438
### Custom Configuration
439
440
```python
441
from pandas_profiling import ProfileReport
442
from pandas_profiling.config import Settings
443
444
# Create custom configuration
445
config = Settings()
446
config = config.update({
447
'vars': {
448
'num': {'quantiles': [0.1, 0.5, 0.9]},
449
'cat': {'characters': True, 'words': True}
450
},
451
'correlations': {
452
'pearson': {'threshold': 0.8}
453
}
454
})
455
456
profile = ProfileReport(df, config=config)
457
profile.to_file("custom_report.html")
458
```
459
460
### Comparing Datasets
461
462
```python
463
from pandas_profiling import ProfileReport, compare
464
465
# Create reports for different datasets
466
report1 = ProfileReport(df_before, title="Before Processing")
467
report2 = ProfileReport(df_after, title="After Processing")
468
469
# Generate comparison report
470
comparison = compare([report1, report2])
471
comparison.to_file("comparison_report.html")
472
```
473
474
### Command Line Usage
475
476
```bash
477
# Generate report from CSV file
478
pandas_profiling --title "My Report" data.csv report.html
479
480
# Use custom configuration
481
pandas_profiling --config_file config.yaml data.csv report.html
482
```