Tessl Tile for pypi/datacompy@0.18.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

column-utilities.md distributed-comparison.md index.md multi-backend-comparison.md pandas-comparison.md reporting.md

pandas-comparison.mddocs/

0
# Pandas DataFrame Comparison
1

2
Core DataFrame comparison functionality for Pandas DataFrames, providing detailed statistical reporting, tolerance-based numeric comparisons, and comprehensive mismatch analysis. This is the primary comparison class in DataComPy.
3

4
## Capabilities
5

6
### Compare Class
7

8
The main comparison class for Pandas DataFrames that performs comprehensive comparison analysis with configurable tolerance settings and detailed reporting.
9

10
```python { .api }
11
class Compare(BaseCompare):
12
    """Comparison class for Pandas DataFrames.
13
    
14
    Both df1 and df2 should be dataframes containing all of the join_columns,
15
    with unique column names. Differences between values are compared to
16
    abs_tol + rel_tol * abs(df2['value']).
17
    """
18
    
19
    def __init__(
20
        self,
21
        df1: pd.DataFrame,
22
        df2: pd.DataFrame,
23
        join_columns: List[str] | str | None = None,
24
        on_index: bool = False,
25
        abs_tol: float | Dict[str, float] = 0,
26
        rel_tol: float | Dict[str, float] = 0,
27
        df1_name: str = "df1",
28
        df2_name: str = "df2",
29
        ignore_spaces: bool = False,
30
        ignore_case: bool = False,
31
        cast_column_names_lower: bool = True
32
    ):
33
        """
34
        Parameters:
35
        - df1: First DataFrame to compare
36
        - df2: Second DataFrame to compare
37
        - join_columns: Column(s) to join dataframes on
38
        - on_index: If True, join on DataFrame index instead of columns
39
        - abs_tol: Absolute tolerance for numeric comparisons (float or dict)
40
        - rel_tol: Relative tolerance for numeric comparisons (float or dict)
41
        - df1_name: Display name for first DataFrame
42
        - df2_name: Display name for second DataFrame
43
        - ignore_spaces: Strip whitespace from string columns
44
        - ignore_case: Ignore case in string comparisons
45
        - cast_column_names_lower: Convert column names to lowercase
46
        """
47
```
48

49
### Properties and Attributes
50

51
Access to comparison results and DataFrame metadata.
52

53
```python { .api }
54
# Properties
55
@property
56
def df1(self) -> pd.DataFrame:
57
    """Get the first dataframe."""
58

59
@property
60
def df2(self) -> pd.DataFrame:
61
    """Get the second dataframe."""
62

63
# Attributes (available after comparison)
64
df1_unq_rows: pd.DataFrame  # Rows only in df1
65
df2_unq_rows: pd.DataFrame  # Rows only in df2
66
intersect_rows: pd.DataFrame  # Shared rows with match indicators
67
column_stats: List[Dict[str, Any]]  # Column-by-column comparison statistics
68
```
69

70
### Column Information Methods
71

72
Methods to analyze column structure and relationships between DataFrames.
73

74
```python { .api }
75
def df1_unq_columns(self) -> OrderedSet[str]:
76
    """Get columns that are unique to df1."""
77

78
def df2_unq_columns(self) -> OrderedSet[str]:
79
    """Get columns that are unique to df2."""
80

81
def intersect_columns(self) -> OrderedSet[str]:
82
    """Get columns that are shared between the two dataframes."""
83

84
def all_columns_match(self) -> bool:
85
    """Check if all columns match between DataFrames."""
86
```
87

88
### Row Comparison Methods
89

90
Methods to analyze row-level differences and overlaps.
91

92
```python { .api }
93
def all_rows_overlap(self) -> bool:
94
    """Check if all rows are present in both DataFrames."""
95

96
def count_matching_rows(self) -> int:
97
    """Count the number of matching rows."""
98

99
def intersect_rows_match(self) -> bool:
100
    """Check if rows that exist in both DataFrames have matching values."""
101
```
102

103
### Matching and Validation Methods
104

105
High-level methods to determine if DataFrames match according to various criteria.
106

107
```python { .api }
108
def matches(self, ignore_extra_columns: bool = False) -> bool:
109
    """
110
    Check if DataFrames match completely.
111
    
112
    Parameters:
113
    - ignore_extra_columns: If True, ignore columns that exist in only one DataFrame
114
    
115
    Returns:
116
    True if DataFrames match, False otherwise
117
    """
118

119
def subset(self) -> bool:
120
    """
121
    Check if df2 is a subset of df1.
122
    
123
    Returns:
124
    True if df2 is a subset of df1, False otherwise
125
    """
126
```
127

128
### Mismatch Analysis Methods
129

130
Methods to identify and analyze specific differences between DataFrames.
131

132
```python { .api }
133
def sample_mismatch(
134
    self,
135
    column: str,
136
    sample_count: int = 10,
137
    for_display: bool = False
138
) -> pd.DataFrame | None:
139
    """
140
    Get a sample of mismatched values for a specific column.
141
    
142
    Parameters:
143
    - column: Name of column to sample
144
    - sample_count: Number of mismatched rows to return
145
    - for_display: Format output for display purposes
146
    
147
    Returns:
148
    DataFrame with sample of mismatched rows, or None if no mismatches
149
    """
150

151
def all_mismatch(self, ignore_matching_cols: bool = False) -> pd.DataFrame:
152
    """
153
    Get all mismatched rows.
154
    
155
    Parameters:
156
    - ignore_matching_cols: If True, exclude columns that match completely
157
    
158
    Returns:
159
    DataFrame containing all rows with mismatches
160
    """
161
```
162

163
### Report Generation
164

165
Comprehensive reporting functionality with customizable output formats.
166

167
```python { .api }
168
def report(
169
    self,
170
    sample_count: int = 10,
171
    column_count: int = 10,
172
    html_file: str | None = None,
173
    template_path: str | None = None
174
) -> str:
175
    """
176
    Generate comprehensive comparison report.
177
    
178
    Parameters:
179
    - sample_count: Number of sample mismatches to include
180
    - column_count: Number of columns to include in detailed stats
181
    - html_file: Path to save HTML report (optional)
182
    - template_path: Custom template path (optional)
183
    
184
    Returns:
185
    String containing the formatted comparison report
186
    """
187
```
188

189
## Usage Examples
190

191
### Basic Comparison
192

193
```python
194
import pandas as pd
195
import datacompy
196

197
# Create test data
198
df1 = pd.DataFrame({
199
    'id': [1, 2, 3, 4],
200
    'value': [10.0, 20.0, 30.0, 40.0],
201
    'status': ['active', 'active', 'inactive', 'active']
202
})
203

204
df2 = pd.DataFrame({
205
    'id': [1, 2, 3, 5],
206
    'value': [10.1, 20.0, 30.0, 50.0],
207
    'status': ['active', 'active', 'inactive', 'pending']
208
})
209

210
# Perform comparison
211
compare = datacompy.Compare(df1, df2, join_columns=['id'])
212

213
# Check results
214
print(f"DataFrames match: {compare.matches()}")
215
print(f"Rows only in df1: {len(compare.df1_unq_rows)}")
216
print(f"Rows only in df2: {len(compare.df2_unq_rows)}")
217
```
218

219
### Tolerance-Based Comparison
220

221
```python
222
# Compare with tolerance for numeric columns
223
compare = datacompy.Compare(
224
    df1, df2, 
225
    join_columns=['id'],
226
    abs_tol=0.1,  # Allow 0.1 absolute difference
227
    rel_tol=0.05  # Allow 5% relative difference
228
)
229

230
# Per-column tolerance
231
compare = datacompy.Compare(
232
    df1, df2,
233
    join_columns=['id'],
234
    abs_tol={'value': 0.2, 'default': 0.1},
235
    rel_tol={'value': 0.1, 'default': 0.05}
236
)
237
```
238

239
### String Comparison Options
240

241
```python
242
# Ignore case and whitespace in string comparisons
243
compare = datacompy.Compare(
244
    df1, df2,
245
    join_columns=['id'],
246
    ignore_case=True,
247
    ignore_spaces=True
248
)
249
```
250

251
### Detailed Reporting
252

253
```python
254
# Generate detailed report
255
report = compare.report(sample_count=20, column_count=15)
256
print(report)
257

258
# Save HTML report
259
compare.report(html_file='comparison_report.html')
260

261
# Get specific mismatch samples
262
value_mismatches = compare.sample_mismatch('value', sample_count=5)
263
print(value_mismatches)
264
```

Version

Tile

Files

pandas-comparison.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

pandas-comparison.mddocs/