0
# Multi-Backend DataFrame Comparison
1
2
Comparison classes for Polars, Spark, and Snowflake DataFrames, providing the same functionality as Pandas comparison but optimized for each backend's specific characteristics and capabilities.
3
4
## Capabilities
5
6
### Polars DataFrame Comparison
7
8
High-performance DataFrame comparison for Polars, leveraging Polars' optimized computation engine while maintaining the same API as Pandas comparison.
9
10
```python { .api }
11
class PolarsCompare(BaseCompare):
12
"""Comparison class for Polars DataFrames."""
13
14
def __init__(
15
self,
16
df1: pl.DataFrame,
17
df2: pl.DataFrame,
18
join_columns: List[str] | str,
19
abs_tol: float | Dict[str, float] = 0,
20
rel_tol: float | Dict[str, float] = 0,
21
df1_name: str = "df1",
22
df2_name: str = "df2",
23
ignore_spaces: bool = False,
24
ignore_case: bool = False,
25
cast_column_names_lower: bool = True
26
):
27
"""
28
Parameters:
29
- df1: First Polars DataFrame to compare
30
- df2: Second Polars DataFrame to compare
31
- join_columns: Column(s) to join dataframes on
32
- abs_tol: Absolute tolerance for numeric comparisons
33
- rel_tol: Relative tolerance for numeric comparisons
34
- df1_name: Display name for first DataFrame
35
- df2_name: Display name for second DataFrame
36
- ignore_spaces: Strip whitespace from string columns
37
- ignore_case: Ignore case in string comparisons
38
- cast_column_names_lower: Convert column names to lowercase
39
"""
40
```
41
42
#### Polars-Specific Properties
43
44
```python { .api }
45
@property
46
def df1(self) -> pl.DataFrame:
47
"""Get the first Polars dataframe."""
48
49
@property
50
def df2(self) -> pl.DataFrame:
51
"""Get the second Polars dataframe."""
52
53
# Attributes
54
df1_unq_rows: pl.DataFrame # Rows only in df1
55
df2_unq_rows: pl.DataFrame # Rows only in df2
56
intersect_rows: pl.DataFrame # Shared rows with match indicators
57
column_stats: List[Dict[str, Any]] # Column comparison statistics
58
```
59
60
### Spark SQL DataFrame Comparison
61
62
Distributed DataFrame comparison for Spark SQL DataFrames, enabling comparison of large-scale datasets with Spark's distributed computing capabilities.
63
64
```python { .api }
65
class SparkSQLCompare(BaseCompare):
66
"""Comparison class for Spark SQL DataFrames."""
67
68
def __init__(
69
self,
70
spark_session: pyspark.sql.SparkSession,
71
df1: pyspark.sql.DataFrame,
72
df2: pyspark.sql.DataFrame,
73
join_columns: List[str] | str,
74
abs_tol: float | Dict[str, float] = 0,
75
rel_tol: float | Dict[str, float] = 0,
76
df1_name: str = "df1",
77
df2_name: str = "df2",
78
ignore_spaces: bool = False,
79
ignore_case: bool = False,
80
cast_column_names_lower: bool = True
81
):
82
"""
83
Parameters:
84
- spark_session: Active Spark session
85
- df1: First Spark DataFrame to compare
86
- df2: Second Spark DataFrame to compare
87
- join_columns: Column(s) to join dataframes on
88
- abs_tol: Absolute tolerance for numeric comparisons
89
- rel_tol: Relative tolerance for numeric comparisons
90
- df1_name: Display name for first DataFrame
91
- df2_name: Display name for second DataFrame
92
- ignore_spaces: Strip whitespace from string columns
93
- ignore_case: Ignore case in string comparisons
94
- cast_column_names_lower: Convert column names to lowercase
95
"""
96
```
97
98
#### Spark-Specific Properties
99
100
```python { .api }
101
@property
102
def df1(self) -> pyspark.sql.DataFrame:
103
"""Get the first Spark dataframe."""
104
105
@property
106
def df2(self) -> pyspark.sql.DataFrame:
107
"""Get the second Spark dataframe."""
108
109
# Attributes
110
df1_unq_rows: pyspark.sql.DataFrame # Rows only in df1
111
df2_unq_rows: pyspark.sql.DataFrame # Rows only in df2
112
intersect_rows: pyspark.sql.DataFrame # Shared rows with match indicators
113
column_stats: List # Column comparison statistics
114
```
115
116
### Snowflake DataFrame Comparison
117
118
Cloud-native DataFrame comparison for Snowflake DataFrames via Snowpark, enabling comparison of data directly in Snowflake's cloud data platform.
119
120
```python { .api }
121
class SnowflakeCompare(BaseCompare):
122
"""Comparison class for Snowflake DataFrames."""
123
124
def __init__(
125
self,
126
session: sp.Session,
127
df1: Union[str, sp.DataFrame],
128
df2: Union[str, sp.DataFrame],
129
join_columns: List[str] | str | None = None,
130
abs_tol: float | Dict[str, float] = 0,
131
rel_tol: float | Dict[str, float] = 0,
132
df1_name: str | None = None,
133
df2_name: str | None = None,
134
ignore_spaces: bool = False
135
):
136
"""
137
Parameters:
138
- session: Snowflake session object
139
- df1: First DataFrame or table name
140
- df2: Second DataFrame or table name
141
- join_columns: Column(s) to join dataframes on
142
- abs_tol: Absolute tolerance for numeric comparisons
143
- rel_tol: Relative tolerance for numeric comparisons
144
- df1_name: Display name for first DataFrame
145
- df2_name: Display name for second DataFrame
146
- ignore_spaces: Strip whitespace from string columns
147
"""
148
```
149
150
#### Snowflake-Specific Properties
151
152
```python { .api }
153
@property
154
def df1(self) -> sp.DataFrame:
155
"""Get the first Snowpark dataframe."""
156
157
@property
158
def df2(self) -> sp.DataFrame:
159
"""Get the second Snowpark dataframe."""
160
161
# Attributes
162
df1_unq_rows: sp.DataFrame # Rows only in df1
163
df2_unq_rows: sp.DataFrame # Rows only in df2
164
intersect_rows: sp.DataFrame # Shared rows with match indicators
165
column_stats: List[Dict[str, Any]] # Column comparison statistics
166
```
167
168
### Common Methods
169
170
All multi-backend comparison classes share the same method signatures as the Pandas Compare class:
171
172
```python { .api }
173
# Column analysis
174
def df1_unq_columns(self) -> OrderedSet[str]: ...
175
def df2_unq_columns(self) -> OrderedSet[str]: ...
176
def intersect_columns(self) -> OrderedSet[str]: ...
177
def all_columns_match(self) -> bool: ...
178
179
# Row analysis
180
def all_rows_overlap(self) -> bool: ...
181
def count_matching_rows(self) -> int: ...
182
def intersect_rows_match(self) -> bool: ...
183
184
# Matching validation
185
def matches(self, ignore_extra_columns: bool = False) -> bool: ...
186
def subset(self) -> bool: ...
187
188
# Mismatch analysis
189
def sample_mismatch(self, column: str, sample_count: int = 10, for_display: bool = False) -> Any: ...
190
def all_mismatch(self, ignore_matching_cols: bool = False) -> Any: ...
191
192
# Reporting
193
def report(
194
self,
195
sample_count: int = 10,
196
column_count: int = 10,
197
html_file: str | None = None,
198
template_path: str | None = None
199
) -> str: ...
200
```
201
202
## Usage Examples
203
204
### Polars DataFrame Comparison
205
206
```python
207
import polars as pl
208
import datacompy
209
210
# Create Polars DataFrames
211
df1 = pl.DataFrame({
212
'id': [1, 2, 3, 4],
213
'value': [10.0, 20.0, 30.0, 40.0],
214
'status': ['active', 'active', 'inactive', 'active']
215
})
216
217
df2 = pl.DataFrame({
218
'id': [1, 2, 3, 5],
219
'value': [10.1, 20.0, 30.0, 50.0],
220
'status': ['active', 'active', 'inactive', 'pending']
221
})
222
223
# Compare with Polars
224
compare = datacompy.PolarsCompare(
225
df1, df2,
226
join_columns=['id'],
227
abs_tol=0.1
228
)
229
230
print(f"DataFrames match: {compare.matches()}")
231
print(compare.report())
232
```
233
234
### Spark DataFrame Comparison
235
236
```python
237
from pyspark.sql import SparkSession
238
import datacompy
239
240
# Initialize Spark session
241
spark = SparkSession.builder.appName("DataComPy").getOrCreate()
242
243
# Create Spark DataFrames
244
df1 = spark.createDataFrame([
245
(1, 10.0, 'active'),
246
(2, 20.0, 'active'),
247
(3, 30.0, 'inactive'),
248
(4, 40.0, 'active')
249
], ['id', 'value', 'status'])
250
251
df2 = spark.createDataFrame([
252
(1, 10.1, 'active'),
253
(2, 20.0, 'active'),
254
(3, 30.0, 'inactive'),
255
(5, 50.0, 'pending')
256
], ['id', 'value', 'status'])
257
258
# Compare with Spark
259
compare = datacompy.SparkSQLCompare(
260
spark, df1, df2,
261
join_columns=['id'],
262
abs_tol=0.1
263
)
264
265
print(f"DataFrames match: {compare.matches()}")
266
print(compare.report())
267
```
268
269
### Snowflake DataFrame Comparison
270
271
```python
272
from snowflake.snowpark import Session
273
import datacompy
274
275
# Create Snowflake session
276
session = Session.builder.configs({
277
'account': 'your_account',
278
'user': 'your_user',
279
'password': 'your_password',
280
'database': 'your_database',
281
'schema': 'your_schema'
282
}).create()
283
284
# Compare tables directly by name
285
compare = datacompy.SnowflakeCompare(
286
session,
287
df1='table1', # Table name
288
df2='table2', # Table name
289
join_columns=['id'],
290
abs_tol=0.1
291
)
292
293
# Or compare DataFrame objects
294
df1 = session.table('table1')
295
df2 = session.table('table2')
296
297
compare = datacompy.SnowflakeCompare(
298
session, df1, df2,
299
join_columns=['id'],
300
abs_tol=0.1
301
)
302
303
print(f"DataFrames match: {compare.matches()}")
304
print(compare.report())
305
```
306
307
## Backend-Specific Considerations
308
309
### Polars Optimizations
310
- Leverages Polars' lazy evaluation for memory efficiency
311
- Optimized string handling with native Polars string operations
312
- Type system mapping for accurate comparisons
313
314
### Spark Distributed Processing
315
- Comparison operations distributed across Spark cluster
316
- Optimized join strategies for large datasets
317
- Checkpoint support for iterative operations
318
319
### Snowflake Cloud Integration
320
- Pushdown operations to Snowflake compute
321
- Direct table name support without loading data
322
- Integration with Snowflake's native data types and functions