0
# Report Comparison
1
2
Compare multiple data profiling reports to identify differences, changes over time, or variations between datasets. This functionality enables tracking data drift, validating data transformations, and understanding how datasets evolve.
3
4
## Capabilities
5
6
### Compare Function
7
8
Primary function for comparing multiple ProfileReport objects or BaseDescription objects to generate comparative analysis.
9
10
```python { .api }
11
def compare(
12
reports: Union[List[ProfileReport], List[BaseDescription]],
13
config: Optional[Settings] = None,
14
compute: bool = False
15
) -> ProfileReport:
16
"""
17
Compare multiple profiling reports and generate comparison analysis.
18
19
Parameters:
20
- reports: list of ProfileReport or BaseDescription objects to compare
21
- config: optional Settings object for comparison configuration
22
- compute: whether to immediately compute comparison results
23
24
Returns:
25
ProfileReport containing comparison results and visualizations
26
27
Raises:
28
ValueError: if reports list is empty or contains incompatible types
29
"""
30
```
31
32
**Usage Example:**
33
34
```python
35
from ydata_profiling import ProfileReport, compare
36
import pandas as pd
37
38
# Create datasets
39
df1 = pd.read_csv('dataset_v1.csv')
40
df2 = pd.read_csv('dataset_v2.csv')
41
42
# Generate individual reports
43
report1 = ProfileReport(df1, title="Dataset Version 1")
44
report2 = ProfileReport(df2, title="Dataset Version 2")
45
46
# Compare reports
47
comparison_report = compare([report1, report2])
48
49
# Export comparison
50
comparison_report.to_file("comparison_analysis.html")
51
52
# Compare with custom configuration
53
from ydata_profiling.config import Settings
54
config = Settings()
55
config.title = "Dataset Evolution Analysis"
56
detailed_comparison = compare([report1, report2], config=config)
57
```
58
59
### Report Instance Comparison
60
61
Compare one ProfileReport directly with another using the instance method.
62
63
```python { .api }
64
def compare(self, other: 'ProfileReport', config: Optional[Settings] = None) -> 'ProfileReport':
65
"""
66
Compare this report with another ProfileReport instance.
67
68
Parameters:
69
- other: another ProfileReport to compare against
70
- config: optional configuration for comparison analysis
71
72
Returns:
73
New ProfileReport containing comparison results
74
"""
75
```
76
77
**Usage Example:**
78
79
```python
80
# Create two reports
81
baseline_report = ProfileReport(baseline_df, title="Baseline Dataset")
82
current_report = ProfileReport(current_df, title="Current Dataset")
83
84
# Compare using instance method
85
drift_analysis = baseline_report.compare(current_report)
86
87
# Generate drift report
88
drift_analysis.to_file("data_drift_report.html")
89
90
# Access comparison results
91
comparison_description = drift_analysis.get_description()
92
```
93
94
### Comparison Configuration
95
96
Configuration options for customizing comparison analysis behavior and output.
97
98
```python { .api }
99
class Settings:
100
"""
101
Configuration class with comparison-specific settings.
102
"""
103
104
# Comparison-specific configuration attributes
105
title: str = "Report Comparison"
106
pool_size: int = 0
107
correlations: dict = {}
108
missing_diagrams: dict = {}
109
```
110
111
**Usage Example:**
112
113
```python
114
from ydata_profiling.config import Settings
115
116
# Create custom comparison configuration
117
comparison_config = Settings()
118
comparison_config.title = "Monthly Data Quality Comparison"
119
comparison_config.pool_size = 4
120
121
# Apply configuration to comparison
122
reports = [jan_report, feb_report, mar_report]
123
quarterly_comparison = compare(
124
reports,
125
config=comparison_config,
126
compute=True
127
)
128
129
quarterly_comparison.to_file("quarterly_data_quality.html")
130
```
131
132
### Multi-Report Comparison
133
134
Compare multiple reports simultaneously to identify trends and patterns across multiple datasets or time periods.
135
136
**Usage Example:**
137
138
```python
139
# Time series of datasets
140
monthly_reports = []
141
for month in ['jan', 'feb', 'mar', 'apr', 'may']:
142
df = pd.read_csv(f'data_{month}.csv')
143
report = ProfileReport(df, title=f"Data - {month.title()}")
144
monthly_reports.append(report)
145
146
# Compare all monthly reports
147
trend_analysis = compare(monthly_reports)
148
trend_analysis.to_file("monthly_trends.html")
149
150
# Access trend data
151
trends = trend_analysis.get_description()
152
```
153
154
### Comparison Types
155
156
Different comparison scenarios and their specific use cases.
157
158
**Data Drift Detection:**
159
160
```python
161
# Monitor data drift over time
162
production_baseline = ProfileReport(baseline_df, title="Production Baseline")
163
current_production = ProfileReport(current_df, title="Current Production")
164
165
drift_report = production_baseline.compare(current_production)
166
drift_report.to_file("drift_monitoring.html")
167
```
168
169
**A/B Testing Analysis:**
170
171
```python
172
# Compare control vs treatment datasets
173
control_report = ProfileReport(control_df, title="Control Group")
174
treatment_report = ProfileReport(treatment_df, title="Treatment Group")
175
176
ab_comparison = compare([control_report, treatment_report])
177
ab_comparison.to_file("ab_test_analysis.html")
178
```
179
180
**Data Pipeline Validation:**
181
182
```python
183
# Compare before and after data transformations
184
raw_data_report = ProfileReport(raw_df, title="Raw Data")
185
processed_data_report = ProfileReport(processed_df, title="Processed Data")
186
187
pipeline_validation = raw_data_report.compare(processed_data_report)
188
pipeline_validation.to_file("pipeline_validation.html")
189
```
190
191
### Comparison Output Features
192
193
Key features available in comparison reports:
194
195
- **Statistical Differences**: Changes in descriptive statistics, distributions, and data quality metrics
196
- **Schema Evolution**: Added, removed, or modified columns and data types
197
- **Data Quality Changes**: New or resolved data quality issues and alerts
198
- **Correlation Changes**: Differences in variable relationships and dependencies
199
- **Missing Data Patterns**: Changes in missing data distribution and patterns
200
- **Sample Comparisons**: Side-by-side sample data for manual inspection
201
- **Duplicate Analysis**: Changes in duplicate row patterns and counts
202
- **Interactive Visualizations**: Comparative charts and graphs for visual analysis