or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

analysis-components.mdconfiguration.mdconsole-interface.mdcore-profiling.mdindex.mdpandas-integration.mdreport-comparison.md

report-comparison.mddocs/

0

# Report Comparison

1

2

Compare multiple data profiling reports to identify differences, changes over time, or variations between datasets. This functionality enables tracking data drift, validating data transformations, and understanding how datasets evolve.

3

4

## Capabilities

5

6

### Compare Function

7

8

Primary function for comparing multiple ProfileReport objects or BaseDescription objects to generate comparative analysis.

9

10

```python { .api }

11

def compare(

12

reports: Union[List[ProfileReport], List[BaseDescription]],

13

config: Optional[Settings] = None,

14

compute: bool = False

15

) -> ProfileReport:

16

"""

17

Compare multiple profiling reports and generate comparison analysis.

18

19

Parameters:

20

- reports: list of ProfileReport or BaseDescription objects to compare

21

- config: optional Settings object for comparison configuration

22

- compute: whether to immediately compute comparison results

23

24

Returns:

25

ProfileReport containing comparison results and visualizations

26

27

Raises:

28

ValueError: if reports list is empty or contains incompatible types

29

"""

30

```

31

32

**Usage Example:**

33

34

```python

35

from ydata_profiling import ProfileReport, compare

36

import pandas as pd

37

38

# Create datasets

39

df1 = pd.read_csv('dataset_v1.csv')

40

df2 = pd.read_csv('dataset_v2.csv')

41

42

# Generate individual reports

43

report1 = ProfileReport(df1, title="Dataset Version 1")

44

report2 = ProfileReport(df2, title="Dataset Version 2")

45

46

# Compare reports

47

comparison_report = compare([report1, report2])

48

49

# Export comparison

50

comparison_report.to_file("comparison_analysis.html")

51

52

# Compare with custom configuration

53

from ydata_profiling.config import Settings

54

config = Settings()

55

config.title = "Dataset Evolution Analysis"

56

detailed_comparison = compare([report1, report2], config=config)

57

```

58

59

### Report Instance Comparison

60

61

Compare one ProfileReport directly with another using the instance method.

62

63

```python { .api }

64

def compare(self, other: 'ProfileReport', config: Optional[Settings] = None) -> 'ProfileReport':

65

"""

66

Compare this report with another ProfileReport instance.

67

68

Parameters:

69

- other: another ProfileReport to compare against

70

- config: optional configuration for comparison analysis

71

72

Returns:

73

New ProfileReport containing comparison results

74

"""

75

```

76

77

**Usage Example:**

78

79

```python

80

# Create two reports

81

baseline_report = ProfileReport(baseline_df, title="Baseline Dataset")

82

current_report = ProfileReport(current_df, title="Current Dataset")

83

84

# Compare using instance method

85

drift_analysis = baseline_report.compare(current_report)

86

87

# Generate drift report

88

drift_analysis.to_file("data_drift_report.html")

89

90

# Access comparison results

91

comparison_description = drift_analysis.get_description()

92

```

93

94

### Comparison Configuration

95

96

Configuration options for customizing comparison analysis behavior and output.

97

98

```python { .api }

99

class Settings:

100

"""

101

Configuration class with comparison-specific settings.

102

"""

103

104

# Comparison-specific configuration attributes

105

title: str = "Report Comparison"

106

pool_size: int = 0

107

correlations: dict = {}

108

missing_diagrams: dict = {}

109

```

110

111

**Usage Example:**

112

113

```python

114

from ydata_profiling.config import Settings

115

116

# Create custom comparison configuration

117

comparison_config = Settings()

118

comparison_config.title = "Monthly Data Quality Comparison"

119

comparison_config.pool_size = 4

120

121

# Apply configuration to comparison

122

reports = [jan_report, feb_report, mar_report]

123

quarterly_comparison = compare(

124

reports,

125

config=comparison_config,

126

compute=True

127

)

128

129

quarterly_comparison.to_file("quarterly_data_quality.html")

130

```

131

132

### Multi-Report Comparison

133

134

Compare multiple reports simultaneously to identify trends and patterns across multiple datasets or time periods.

135

136

**Usage Example:**

137

138

```python

139

# Time series of datasets

140

monthly_reports = []

141

for month in ['jan', 'feb', 'mar', 'apr', 'may']:

142

df = pd.read_csv(f'data_{month}.csv')

143

report = ProfileReport(df, title=f"Data - {month.title()}")

144

monthly_reports.append(report)

145

146

# Compare all monthly reports

147

trend_analysis = compare(monthly_reports)

148

trend_analysis.to_file("monthly_trends.html")

149

150

# Access trend data

151

trends = trend_analysis.get_description()

152

```

153

154

### Comparison Types

155

156

Different comparison scenarios and their specific use cases.

157

158

**Data Drift Detection:**

159

160

```python

161

# Monitor data drift over time

162

production_baseline = ProfileReport(baseline_df, title="Production Baseline")

163

current_production = ProfileReport(current_df, title="Current Production")

164

165

drift_report = production_baseline.compare(current_production)

166

drift_report.to_file("drift_monitoring.html")

167

```

168

169

**A/B Testing Analysis:**

170

171

```python

172

# Compare control vs treatment datasets

173

control_report = ProfileReport(control_df, title="Control Group")

174

treatment_report = ProfileReport(treatment_df, title="Treatment Group")

175

176

ab_comparison = compare([control_report, treatment_report])

177

ab_comparison.to_file("ab_test_analysis.html")

178

```

179

180

**Data Pipeline Validation:**

181

182

```python

183

# Compare before and after data transformations

184

raw_data_report = ProfileReport(raw_df, title="Raw Data")

185

processed_data_report = ProfileReport(processed_df, title="Processed Data")

186

187

pipeline_validation = raw_data_report.compare(processed_data_report)

188

pipeline_validation.to_file("pipeline_validation.html")

189

```

190

191

### Comparison Output Features

192

193

Key features available in comparison reports:

194

195

- **Statistical Differences**: Changes in descriptive statistics, distributions, and data quality metrics

196

- **Schema Evolution**: Added, removed, or modified columns and data types

197

- **Data Quality Changes**: New or resolved data quality issues and alerts

198

- **Correlation Changes**: Differences in variable relationships and dependencies

199

- **Missing Data Patterns**: Changes in missing data distribution and patterns

200

- **Sample Comparisons**: Side-by-side sample data for manual inspection

201

- **Duplicate Analysis**: Changes in duplicate row patterns and counts

202

- **Interactive Visualizations**: Comparative charts and graphs for visual analysis