0
# YData Profiling
1
2
A comprehensive Python library that provides one-line Exploratory Data Analysis (EDA) for pandas DataFrames. YData Profiling generates detailed profile reports with statistical analysis, data quality warnings, correlations, missing data patterns, and interactive visualizations - transforming raw data understanding from hours of manual exploration into automated, publication-ready reports.
3
4
## Package Information
5
6
- **Package Name**: ydata-profiling
7
- **Language**: Python
8
- **Installation**: `pip install ydata-profiling`
9
- **Backward Compatibility**: Available as `pandas-profiling` (deprecated)
10
11
## Core Imports
12
13
```python
14
from ydata_profiling import ProfileReport
15
```
16
17
Common imports for advanced usage:
18
19
```python
20
from ydata_profiling import ProfileReport, compare, __version__
21
from ydata_profiling.config import Settings, SparkSettings
22
```
23
24
## Basic Usage
25
26
```python
27
import pandas as pd
28
from ydata_profiling import ProfileReport
29
30
# Load your data
31
df = pd.read_csv('your_data.csv')
32
33
# Generate comprehensive report with one line
34
report = ProfileReport(df, title='Dataset Analysis Report')
35
36
# Export report
37
report.to_file('data_report.html')
38
39
# Display in Jupyter notebook
40
report.to_notebook_iframe()
41
42
# Get interactive widgets
43
report.to_widgets()
44
```
45
46
## Architecture
47
48
YData Profiling uses a modular architecture for extensible data analysis:
49
50
- **ProfileReport**: Main orchestrator class managing analysis pipeline and report generation
51
- **Summarizers**: Statistical computation engines (pandas-based, Spark-compatible)
52
- **Type System**: Intelligent data type inference using visions library integration
53
- **Configuration System**: Comprehensive settings for customizing analysis depth and output
54
- **Report Generation**: Multi-format output system (HTML, JSON, widgets) with templating
55
- **Backend Support**: Pandas and Spark DataFrame compatibility for scalable analysis
56
57
This design enables automated EDA workflows, integration with data pipelines, and customization for domain-specific analysis requirements across data science and analytics teams.
58
59
## Capabilities
60
61
### Core Profiling
62
63
Primary functionality for generating comprehensive data profile reports from DataFrames, including statistical analysis, data quality assessment, and automated report generation.
64
65
```python { .api }
66
class ProfileReport:
67
def __init__(
68
self,
69
df: Optional[Union[pd.DataFrame, sDataFrame]] = None,
70
minimal: bool = False,
71
tsmode: bool = False,
72
sortby: Optional[str] = None,
73
sensitive: bool = False,
74
explorative: bool = False,
75
sample: Optional[dict] = None,
76
config_file: Optional[Union[Path, str]] = None,
77
lazy: bool = True,
78
typeset: Optional[VisionsTypeset] = None,
79
summarizer: Optional[BaseSummarizer] = None,
80
config: Optional[Settings] = None,
81
type_schema: Optional[dict] = None,
82
**kwargs
83
): ...
84
85
def to_file(self, output_file: Union[str, Path], silent: bool = True): ...
86
def to_html(self) -> str: ...
87
def to_json(self) -> str: ...
88
def to_notebook_iframe(self): ...
89
def to_widgets(self): ...
90
```
91
92
[Core Profiling](./core-profiling.md)
93
94
### Report Comparison
95
96
Compare multiple data profiling reports to identify differences, changes over time, or variations between datasets.
97
98
```python { .api }
99
def compare(
100
reports: Union[List[ProfileReport], List[BaseDescription]],
101
config: Optional[Settings] = None,
102
compute: bool = False
103
) -> ProfileReport: ...
104
```
105
106
[Report Comparison](./report-comparison.md)
107
108
### Configuration System
109
110
Comprehensive configuration system for customizing analysis depth, statistical computations, visualizations, and report output formats.
111
112
```python { .api }
113
class Settings:
114
def __init__(self, **kwargs): ...
115
116
class SparkSettings:
117
def __init__(self, **kwargs): ...
118
119
class Config:
120
@staticmethod
121
def get_config() -> Settings: ...
122
```
123
124
[Configuration](./configuration.md)
125
126
### Data Analysis Components
127
128
Detailed statistical analysis components including correlation analysis, missing data patterns, duplicate detection, and specialized analysis for different data types.
129
130
```python { .api }
131
class BaseDescription: ...
132
class BaseSummarizer: ...
133
class ProfilingSummarizer: ...
134
135
def format_summary(description: BaseDescription) -> dict: ...
136
```
137
138
[Analysis Components](./analysis-components.md)
139
140
### Pandas Integration
141
142
Direct integration with pandas DataFrames through monkey patching that adds profiling capability directly to pandas DataFrames.
143
144
```python { .api }
145
def profile_report(
146
self,
147
minimal: bool = False,
148
tsmode: bool = False,
149
sortby: Optional[str] = None,
150
sensitive: bool = False,
151
explorative: bool = False,
152
**kwargs
153
) -> ProfileReport: ...
154
```
155
156
[Pandas Integration](./pandas-integration.md)
157
158
### Serialization and Persistence
159
160
Save and load ProfileReport objects for reuse, storage, and sharing across sessions.
161
162
```python { .api }
163
def dumps(self) -> bytes: ...
164
def loads(data: bytes) -> Union['ProfileReport', 'SerializeReport']: ...
165
def dump(self, output_file: Union[Path, str]) -> None: ...
166
def load(load_file: Union[Path, str]) -> Union['ProfileReport', 'SerializeReport']: ...
167
```
168
169
**Capabilities:** Report serialization, persistent storage, cross-session report sharing, and efficient report caching for large datasets.
170
171
### Great Expectations Integration
172
173
Generate data validation expectations directly from profiling results for ongoing data quality monitoring.
174
175
```python { .api }
176
def to_expectation_suite(
177
self,
178
suite_name: Optional[str] = None,
179
data_context: Optional[Any] = None,
180
save_suite: bool = True,
181
run_validation: bool = True,
182
build_data_docs: bool = True,
183
handler: Optional[Handler] = None
184
) -> Any: ...
185
```
186
187
**Capabilities:** Automated expectation generation, data validation pipeline integration, and continuous data quality monitoring.
188
189
### Version and Package Information
190
191
Access package version and metadata for compatibility and debugging purposes.
192
193
```python { .api }
194
__version__: str # Package version string
195
```
196
197
**Usage:** Version checking, compatibility validation, and debugging support.
198
199
### Console Interface
200
201
Command-line interface for generating profiling reports directly from CSV files without writing Python code.
202
203
```bash { .api }
204
ydata_profiling [OPTIONS] INPUT_FILE OUTPUT_FILE
205
```
206
207
**Capabilities:** Direct CSV profiling, automated report generation, CI/CD pipeline integration, and shell script automation.
208
209
[Console Interface](./console-interface.md)
210
211
## Types
212
213
```python { .api }
214
from typing import Optional, Union, List, Dict, Any
215
from pathlib import Path
216
import pandas as pd
217
218
# Core DataFrame types
219
try:
220
from pyspark.sql import DataFrame as sDataFrame
221
except ImportError:
222
from typing import TypeVar
223
sDataFrame = TypeVar("sDataFrame")
224
225
# Configuration types
226
class Settings:
227
dataset: DatasetConfig
228
variables: VariablesConfig
229
correlations: CorrelationsConfig
230
plot: PlotConfig
231
html: HtmlConfig
232
style: StyleConfig
233
234
class SparkSettings(Settings):
235
"""Specialized Settings for Spark DataFrames with performance optimizations"""
236
pass
237
238
# Analysis result types
239
class BaseDescription:
240
"""Complete dataset description with analysis results"""
241
pass
242
243
class BaseAnalysis:
244
"""Base analysis metadata"""
245
pass
246
247
# Summarizer types
248
class BaseSummarizer:
249
"""Base statistical summarizer interface"""
250
pass
251
252
class ProfilingSummarizer(BaseSummarizer):
253
"""Default profiling summarizer implementation"""
254
pass
255
256
# Alert system types
257
from enum import Enum
258
259
class AlertType(Enum):
260
"""Types of data quality alerts"""
261
pass
262
263
class Alert:
264
"""Individual data quality alert"""
265
pass
266
```