0
# Pandas DataFrame Comparison
1
2
Core DataFrame comparison functionality for Pandas DataFrames, providing detailed statistical reporting, tolerance-based numeric comparisons, and comprehensive mismatch analysis. This is the primary comparison class in DataComPy.
3
4
## Capabilities
5
6
### Compare Class
7
8
The main comparison class for Pandas DataFrames that performs comprehensive comparison analysis with configurable tolerance settings and detailed reporting.
9
10
```python { .api }
11
class Compare(BaseCompare):
12
"""Comparison class for Pandas DataFrames.
13
14
Both df1 and df2 should be dataframes containing all of the join_columns,
15
with unique column names. Differences between values are compared to
16
abs_tol + rel_tol * abs(df2['value']).
17
"""
18
19
def __init__(
20
self,
21
df1: pd.DataFrame,
22
df2: pd.DataFrame,
23
join_columns: List[str] | str | None = None,
24
on_index: bool = False,
25
abs_tol: float | Dict[str, float] = 0,
26
rel_tol: float | Dict[str, float] = 0,
27
df1_name: str = "df1",
28
df2_name: str = "df2",
29
ignore_spaces: bool = False,
30
ignore_case: bool = False,
31
cast_column_names_lower: bool = True
32
):
33
"""
34
Parameters:
35
- df1: First DataFrame to compare
36
- df2: Second DataFrame to compare
37
- join_columns: Column(s) to join dataframes on
38
- on_index: If True, join on DataFrame index instead of columns
39
- abs_tol: Absolute tolerance for numeric comparisons (float or dict)
40
- rel_tol: Relative tolerance for numeric comparisons (float or dict)
41
- df1_name: Display name for first DataFrame
42
- df2_name: Display name for second DataFrame
43
- ignore_spaces: Strip whitespace from string columns
44
- ignore_case: Ignore case in string comparisons
45
- cast_column_names_lower: Convert column names to lowercase
46
"""
47
```
48
49
### Properties and Attributes
50
51
Access to comparison results and DataFrame metadata.
52
53
```python { .api }
54
# Properties
55
@property
56
def df1(self) -> pd.DataFrame:
57
"""Get the first dataframe."""
58
59
@property
60
def df2(self) -> pd.DataFrame:
61
"""Get the second dataframe."""
62
63
# Attributes (available after comparison)
64
df1_unq_rows: pd.DataFrame # Rows only in df1
65
df2_unq_rows: pd.DataFrame # Rows only in df2
66
intersect_rows: pd.DataFrame # Shared rows with match indicators
67
column_stats: List[Dict[str, Any]] # Column-by-column comparison statistics
68
```
69
70
### Column Information Methods
71
72
Methods to analyze column structure and relationships between DataFrames.
73
74
```python { .api }
75
def df1_unq_columns(self) -> OrderedSet[str]:
76
"""Get columns that are unique to df1."""
77
78
def df2_unq_columns(self) -> OrderedSet[str]:
79
"""Get columns that are unique to df2."""
80
81
def intersect_columns(self) -> OrderedSet[str]:
82
"""Get columns that are shared between the two dataframes."""
83
84
def all_columns_match(self) -> bool:
85
"""Check if all columns match between DataFrames."""
86
```
87
88
### Row Comparison Methods
89
90
Methods to analyze row-level differences and overlaps.
91
92
```python { .api }
93
def all_rows_overlap(self) -> bool:
94
"""Check if all rows are present in both DataFrames."""
95
96
def count_matching_rows(self) -> int:
97
"""Count the number of matching rows."""
98
99
def intersect_rows_match(self) -> bool:
100
"""Check if rows that exist in both DataFrames have matching values."""
101
```
102
103
### Matching and Validation Methods
104
105
High-level methods to determine if DataFrames match according to various criteria.
106
107
```python { .api }
108
def matches(self, ignore_extra_columns: bool = False) -> bool:
109
"""
110
Check if DataFrames match completely.
111
112
Parameters:
113
- ignore_extra_columns: If True, ignore columns that exist in only one DataFrame
114
115
Returns:
116
True if DataFrames match, False otherwise
117
"""
118
119
def subset(self) -> bool:
120
"""
121
Check if df2 is a subset of df1.
122
123
Returns:
124
True if df2 is a subset of df1, False otherwise
125
"""
126
```
127
128
### Mismatch Analysis Methods
129
130
Methods to identify and analyze specific differences between DataFrames.
131
132
```python { .api }
133
def sample_mismatch(
134
self,
135
column: str,
136
sample_count: int = 10,
137
for_display: bool = False
138
) -> pd.DataFrame | None:
139
"""
140
Get a sample of mismatched values for a specific column.
141
142
Parameters:
143
- column: Name of column to sample
144
- sample_count: Number of mismatched rows to return
145
- for_display: Format output for display purposes
146
147
Returns:
148
DataFrame with sample of mismatched rows, or None if no mismatches
149
"""
150
151
def all_mismatch(self, ignore_matching_cols: bool = False) -> pd.DataFrame:
152
"""
153
Get all mismatched rows.
154
155
Parameters:
156
- ignore_matching_cols: If True, exclude columns that match completely
157
158
Returns:
159
DataFrame containing all rows with mismatches
160
"""
161
```
162
163
### Report Generation
164
165
Comprehensive reporting functionality with customizable output formats.
166
167
```python { .api }
168
def report(
169
self,
170
sample_count: int = 10,
171
column_count: int = 10,
172
html_file: str | None = None,
173
template_path: str | None = None
174
) -> str:
175
"""
176
Generate comprehensive comparison report.
177
178
Parameters:
179
- sample_count: Number of sample mismatches to include
180
- column_count: Number of columns to include in detailed stats
181
- html_file: Path to save HTML report (optional)
182
- template_path: Custom template path (optional)
183
184
Returns:
185
String containing the formatted comparison report
186
"""
187
```
188
189
## Usage Examples
190
191
### Basic Comparison
192
193
```python
194
import pandas as pd
195
import datacompy
196
197
# Create test data
198
df1 = pd.DataFrame({
199
'id': [1, 2, 3, 4],
200
'value': [10.0, 20.0, 30.0, 40.0],
201
'status': ['active', 'active', 'inactive', 'active']
202
})
203
204
df2 = pd.DataFrame({
205
'id': [1, 2, 3, 5],
206
'value': [10.1, 20.0, 30.0, 50.0],
207
'status': ['active', 'active', 'inactive', 'pending']
208
})
209
210
# Perform comparison
211
compare = datacompy.Compare(df1, df2, join_columns=['id'])
212
213
# Check results
214
print(f"DataFrames match: {compare.matches()}")
215
print(f"Rows only in df1: {len(compare.df1_unq_rows)}")
216
print(f"Rows only in df2: {len(compare.df2_unq_rows)}")
217
```
218
219
### Tolerance-Based Comparison
220
221
```python
222
# Compare with tolerance for numeric columns
223
compare = datacompy.Compare(
224
df1, df2,
225
join_columns=['id'],
226
abs_tol=0.1, # Allow 0.1 absolute difference
227
rel_tol=0.05 # Allow 5% relative difference
228
)
229
230
# Per-column tolerance
231
compare = datacompy.Compare(
232
df1, df2,
233
join_columns=['id'],
234
abs_tol={'value': 0.2, 'default': 0.1},
235
rel_tol={'value': 0.1, 'default': 0.05}
236
)
237
```
238
239
### String Comparison Options
240
241
```python
242
# Ignore case and whitespace in string comparisons
243
compare = datacompy.Compare(
244
df1, df2,
245
join_columns=['id'],
246
ignore_case=True,
247
ignore_spaces=True
248
)
249
```
250
251
### Detailed Reporting
252
253
```python
254
# Generate detailed report
255
report = compare.report(sample_count=20, column_count=15)
256
print(report)
257
258
# Save HTML report
259
compare.report(html_file='comparison_report.html')
260
261
# Get specific mismatch samples
262
value_mismatches = compare.sample_mismatch('value', sample_count=5)
263
print(value_mismatches)
264
```