Comprehensive DataFrame comparison library providing functionality equivalent to SAS's PROC COMPARE for Python with support for Pandas, Spark, Polars, Snowflake, and distributed computing
npx @tessl/cli install tessl/pypi-datacompy@0.18.00
# DataComPy
1
2
DataComPy is a comprehensive DataFrame comparison library that provides functionality equivalent to SAS's PROC COMPARE for Python data analysis workflows. It supports comparison across multiple DataFrame backends including Pandas, Spark, Polars, Snowflake (via Snowpark), Dask (via Fugue), and DuckDB (via Fugue), making it a versatile tool for data validation and quality assurance.
3
4
## Package Information
5
6
- **Package Name**: datacompy
7
- **Language**: Python
8
- **Installation**: `pip install datacompy`
9
- **Optional dependencies**:
10
- Spark: `pip install datacompy[spark]`
11
- Fugue (Dask/DuckDB): `pip install datacompy[fugue]`
12
- Snowflake: `pip install datacompy[snowflake]`
13
14
## Core Imports
15
16
```python
17
import datacompy
18
```
19
20
Specific comparison classes:
21
22
```python
23
from datacompy import Compare, PolarsCompare, SparkSQLCompare, SnowflakeCompare
24
```
25
26
Utility functions:
27
28
```python
29
from datacompy import columns_equal, is_match, report, all_columns_match
30
```
31
32
## Basic Usage
33
34
```python
35
import pandas as pd
36
import datacompy
37
38
# Create sample DataFrames
39
df1 = pd.DataFrame({
40
'id': [1, 2, 3, 4],
41
'name': ['Alice', 'Bob', 'Charlie', 'David'],
42
'score': [85.5, 92.0, 78.5, 91.0]
43
})
44
45
df2 = pd.DataFrame({
46
'id': [1, 2, 3, 5],
47
'name': ['Alice', 'Bob', 'Charlie', 'Eve'],
48
'score': [85.5, 92.1, 78.5, 89.0]
49
})
50
51
# Compare DataFrames
52
compare = datacompy.Compare(df1, df2, join_columns=['id'])
53
54
# Check if DataFrames match
55
if compare.matches():
56
print("DataFrames are identical")
57
else:
58
print("DataFrames differ")
59
print(compare.report())
60
61
# Access comparison results
62
print(f"Rows in df1 only: {len(compare.df1_unq_rows)}")
63
print(f"Rows in df2 only: {len(compare.df2_unq_rows)}")
64
print(f"Shared rows: {len(compare.intersect_rows)}")
65
```
66
67
## Architecture
68
69
DataComPy uses a consistent architecture across all DataFrame backends:
70
71
- **BaseCompare**: Abstract base class defining the common comparison interface
72
- **Backend-Specific Classes**: Concrete implementations for each DataFrame library (Compare for Pandas, PolarsCompare for Polars, etc.)
73
- **Unified API**: All comparison classes share the same method signatures and behavior patterns
74
- **Tolerance Support**: Configurable absolute and relative tolerance for numeric comparisons
75
- **Distributed Computing**: Fugue integration enables scaling across multiple compute backends
76
77
This design allows seamless switching between DataFrame libraries while maintaining identical functionality and API consistency.
78
79
## Capabilities
80
81
### Pandas DataFrame Comparison
82
83
Core DataFrame comparison functionality for Pandas, including detailed statistical reporting, tolerance-based numeric comparisons, and comprehensive mismatch analysis.
84
85
```python { .api }
86
class Compare(BaseCompare):
87
def __init__(
88
self,
89
df1: pd.DataFrame,
90
df2: pd.DataFrame,
91
join_columns: List[str] | str | None = None,
92
on_index: bool = False,
93
abs_tol: float | Dict[str, float] = 0,
94
rel_tol: float | Dict[str, float] = 0,
95
df1_name: str = "df1",
96
df2_name: str = "df2",
97
ignore_spaces: bool = False,
98
ignore_case: bool = False,
99
cast_column_names_lower: bool = True
100
): ...
101
102
def matches(self, ignore_extra_columns: bool = False) -> bool: ...
103
def report(
104
self,
105
sample_count: int = 10,
106
column_count: int = 10,
107
html_file: str | None = None,
108
template_path: str | None = None
109
) -> str: ...
110
```
111
112
[Pandas DataFrame Comparison](./pandas-comparison.md)
113
114
### Multi-Backend DataFrame Comparison
115
116
Comparison classes for Polars, Spark, and Snowflake DataFrames, providing the same functionality as Pandas comparison but optimized for each backend's specific characteristics and capabilities.
117
118
```python { .api }
119
class PolarsCompare(BaseCompare): ...
120
class SparkSQLCompare(BaseCompare): ...
121
class SnowflakeCompare(BaseCompare): ...
122
```
123
124
[Multi-Backend Comparison](./multi-backend-comparison.md)
125
126
### Distributed DataFrame Comparison
127
128
Fugue-powered distributed comparison functions that work across multiple backends including Dask, DuckDB, Ray, and Arrow, enabling scalable comparison of large datasets.
129
130
```python { .api }
131
def is_match(
132
df1: AnyDataFrame,
133
df2: AnyDataFrame,
134
join_columns: str | List[str],
135
abs_tol: float = 0,
136
rel_tol: float = 0,
137
df1_name: str = "df1",
138
df2_name: str = "df2",
139
ignore_spaces: bool = False,
140
ignore_case: bool = False,
141
cast_column_names_lower: bool = True,
142
parallelism: int | None = None,
143
strict_schema: bool = False
144
) -> bool: ...
145
146
def report(
147
df1: AnyDataFrame,
148
df2: AnyDataFrame,
149
join_columns: str | List[str],
150
abs_tol: float = 0,
151
rel_tol: float = 0,
152
df1_name: str = "df1",
153
df2_name: str = "df2",
154
ignore_spaces: bool = False,
155
ignore_case: bool = False,
156
cast_column_names_lower: bool = True,
157
sample_count: int = 10,
158
column_count: int = 10,
159
html_file: str | None = None,
160
parallelism: int | None = None
161
) -> str: ...
162
```
163
164
[Distributed Comparison](./distributed-comparison.md)
165
166
### Column-Level Comparison Utilities
167
168
Low-level functions for comparing individual columns and performing specialized comparisons, useful for custom comparison logic and integration with other data processing workflows.
169
170
```python { .api }
171
def columns_equal(
172
col_1: pd.Series[Any],
173
col_2: pd.Series[Any],
174
rel_tol: float = 0,
175
abs_tol: float = 0,
176
ignore_spaces: bool = False,
177
ignore_case: bool = False
178
) -> pd.Series[bool]: ...
179
180
def calculate_max_diff(col_1: pd.Series[Any], col_2: pd.Series[Any]) -> float: ...
181
```
182
183
[Column Utilities](./column-utilities.md)
184
185
### Reporting and Output
186
187
Template-based reporting system with customizable HTML and text output, providing detailed comparison statistics, mismatch samples, and publication-ready reports.
188
189
```python { .api }
190
def render(template_name: str, **context: Any) -> str: ...
191
def save_html_report(report: str, html_file: str | Path) -> None: ...
192
def df_to_str(df: Any, sample_count: int | None, on_index: bool) -> str: ...
193
```
194
195
[Reporting System](./reporting.md)