Tessl Tile for pypi/datacompy@0.18.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

column-utilities.md distributed-comparison.md index.md multi-backend-comparison.md pandas-comparison.md reporting.md

multi-backend-comparison.mddocs/

0
# Multi-Backend DataFrame Comparison
1

2
Comparison classes for Polars, Spark, and Snowflake DataFrames, providing the same functionality as Pandas comparison but optimized for each backend's specific characteristics and capabilities.
3

4
## Capabilities
5

6
### Polars DataFrame Comparison
7

8
High-performance DataFrame comparison for Polars, leveraging Polars' optimized computation engine while maintaining the same API as Pandas comparison.
9

10
```python { .api }
11
class PolarsCompare(BaseCompare):
12
    """Comparison class for Polars DataFrames."""
13
    
14
    def __init__(
15
        self,
16
        df1: pl.DataFrame,
17
        df2: pl.DataFrame,
18
        join_columns: List[str] | str,
19
        abs_tol: float | Dict[str, float] = 0,
20
        rel_tol: float | Dict[str, float] = 0,
21
        df1_name: str = "df1",
22
        df2_name: str = "df2",
23
        ignore_spaces: bool = False,
24
        ignore_case: bool = False,
25
        cast_column_names_lower: bool = True
26
    ):
27
        """
28
        Parameters:
29
        - df1: First Polars DataFrame to compare
30
        - df2: Second Polars DataFrame to compare
31
        - join_columns: Column(s) to join dataframes on
32
        - abs_tol: Absolute tolerance for numeric comparisons
33
        - rel_tol: Relative tolerance for numeric comparisons
34
        - df1_name: Display name for first DataFrame
35
        - df2_name: Display name for second DataFrame
36
        - ignore_spaces: Strip whitespace from string columns
37
        - ignore_case: Ignore case in string comparisons
38
        - cast_column_names_lower: Convert column names to lowercase
39
        """
40
```
41

42
#### Polars-Specific Properties
43

44
```python { .api }
45
@property
46
def df1(self) -> pl.DataFrame:
47
    """Get the first Polars dataframe."""
48

49
@property
50
def df2(self) -> pl.DataFrame:
51
    """Get the second Polars dataframe."""
52

53
# Attributes
54
df1_unq_rows: pl.DataFrame  # Rows only in df1
55
df2_unq_rows: pl.DataFrame  # Rows only in df2
56
intersect_rows: pl.DataFrame  # Shared rows with match indicators
57
column_stats: List[Dict[str, Any]]  # Column comparison statistics
58
```
59

60
### Spark SQL DataFrame Comparison
61

62
Distributed DataFrame comparison for Spark SQL DataFrames, enabling comparison of large-scale datasets with Spark's distributed computing capabilities.
63

64
```python { .api }
65
class SparkSQLCompare(BaseCompare):
66
    """Comparison class for Spark SQL DataFrames."""
67
    
68
    def __init__(
69
        self,
70
        spark_session: pyspark.sql.SparkSession,
71
        df1: pyspark.sql.DataFrame,
72
        df2: pyspark.sql.DataFrame,
73
        join_columns: List[str] | str,
74
        abs_tol: float | Dict[str, float] = 0,
75
        rel_tol: float | Dict[str, float] = 0,
76
        df1_name: str = "df1",
77
        df2_name: str = "df2",
78
        ignore_spaces: bool = False,
79
        ignore_case: bool = False,
80
        cast_column_names_lower: bool = True
81
    ):
82
        """
83
        Parameters:
84
        - spark_session: Active Spark session
85
        - df1: First Spark DataFrame to compare
86
        - df2: Second Spark DataFrame to compare
87
        - join_columns: Column(s) to join dataframes on
88
        - abs_tol: Absolute tolerance for numeric comparisons
89
        - rel_tol: Relative tolerance for numeric comparisons
90
        - df1_name: Display name for first DataFrame
91
        - df2_name: Display name for second DataFrame
92
        - ignore_spaces: Strip whitespace from string columns
93
        - ignore_case: Ignore case in string comparisons
94
        - cast_column_names_lower: Convert column names to lowercase
95
        """
96
```
97

98
#### Spark-Specific Properties
99

100
```python { .api }
101
@property
102
def df1(self) -> pyspark.sql.DataFrame:
103
    """Get the first Spark dataframe."""
104

105
@property
106
def df2(self) -> pyspark.sql.DataFrame:
107
    """Get the second Spark dataframe."""
108

109
# Attributes
110
df1_unq_rows: pyspark.sql.DataFrame  # Rows only in df1
111
df2_unq_rows: pyspark.sql.DataFrame  # Rows only in df2
112
intersect_rows: pyspark.sql.DataFrame  # Shared rows with match indicators
113
column_stats: List  # Column comparison statistics
114
```
115

116
### Snowflake DataFrame Comparison
117

118
Cloud-native DataFrame comparison for Snowflake DataFrames via Snowpark, enabling comparison of data directly in Snowflake's cloud data platform.
119

120
```python { .api }
121
class SnowflakeCompare(BaseCompare):
122
    """Comparison class for Snowflake DataFrames."""
123
    
124
    def __init__(
125
        self,
126
        session: sp.Session,
127
        df1: Union[str, sp.DataFrame],
128
        df2: Union[str, sp.DataFrame],
129
        join_columns: List[str] | str | None = None,
130
        abs_tol: float | Dict[str, float] = 0,
131
        rel_tol: float | Dict[str, float] = 0,
132
        df1_name: str | None = None,
133
        df2_name: str | None = None,
134
        ignore_spaces: bool = False
135
    ):
136
        """
137
        Parameters:
138
        - session: Snowflake session object
139
        - df1: First DataFrame or table name
140
        - df2: Second DataFrame or table name
141
        - join_columns: Column(s) to join dataframes on
142
        - abs_tol: Absolute tolerance for numeric comparisons
143
        - rel_tol: Relative tolerance for numeric comparisons
144
        - df1_name: Display name for first DataFrame
145
        - df2_name: Display name for second DataFrame
146
        - ignore_spaces: Strip whitespace from string columns
147
        """
148
```
149

150
#### Snowflake-Specific Properties
151

152
```python { .api }
153
@property
154
def df1(self) -> sp.DataFrame:
155
    """Get the first Snowpark dataframe."""
156

157
@property
158
def df2(self) -> sp.DataFrame:
159
    """Get the second Snowpark dataframe."""
160

161
# Attributes
162
df1_unq_rows: sp.DataFrame  # Rows only in df1
163
df2_unq_rows: sp.DataFrame  # Rows only in df2
164
intersect_rows: sp.DataFrame  # Shared rows with match indicators
165
column_stats: List[Dict[str, Any]]  # Column comparison statistics
166
```
167

168
### Common Methods
169

170
All multi-backend comparison classes share the same method signatures as the Pandas Compare class:
171

172
```python { .api }
173
# Column analysis
174
def df1_unq_columns(self) -> OrderedSet[str]: ...
175
def df2_unq_columns(self) -> OrderedSet[str]: ...
176
def intersect_columns(self) -> OrderedSet[str]: ...
177
def all_columns_match(self) -> bool: ...
178

179
# Row analysis
180
def all_rows_overlap(self) -> bool: ...
181
def count_matching_rows(self) -> int: ...
182
def intersect_rows_match(self) -> bool: ...
183

184
# Matching validation
185
def matches(self, ignore_extra_columns: bool = False) -> bool: ...
186
def subset(self) -> bool: ...
187

188
# Mismatch analysis
189
def sample_mismatch(self, column: str, sample_count: int = 10, for_display: bool = False) -> Any: ...
190
def all_mismatch(self, ignore_matching_cols: bool = False) -> Any: ...
191

192
# Reporting
193
def report(
194
    self,
195
    sample_count: int = 10,
196
    column_count: int = 10,
197
    html_file: str | None = None,
198
    template_path: str | None = None
199
) -> str: ...
200
```
201

202
## Usage Examples
203

204
### Polars DataFrame Comparison
205

206
```python
207
import polars as pl
208
import datacompy
209

210
# Create Polars DataFrames
211
df1 = pl.DataFrame({
212
    'id': [1, 2, 3, 4],
213
    'value': [10.0, 20.0, 30.0, 40.0],
214
    'status': ['active', 'active', 'inactive', 'active']
215
})
216

217
df2 = pl.DataFrame({
218
    'id': [1, 2, 3, 5],
219
    'value': [10.1, 20.0, 30.0, 50.0],
220
    'status': ['active', 'active', 'inactive', 'pending']
221
})
222

223
# Compare with Polars
224
compare = datacompy.PolarsCompare(
225
    df1, df2,
226
    join_columns=['id'],
227
    abs_tol=0.1
228
)
229

230
print(f"DataFrames match: {compare.matches()}")
231
print(compare.report())
232
```
233

234
### Spark DataFrame Comparison
235

236
```python
237
from pyspark.sql import SparkSession
238
import datacompy
239

240
# Initialize Spark session
241
spark = SparkSession.builder.appName("DataComPy").getOrCreate()
242

243
# Create Spark DataFrames
244
df1 = spark.createDataFrame([
245
    (1, 10.0, 'active'),
246
    (2, 20.0, 'active'),
247
    (3, 30.0, 'inactive'),
248
    (4, 40.0, 'active')
249
], ['id', 'value', 'status'])
250

251
df2 = spark.createDataFrame([
252
    (1, 10.1, 'active'),
253
    (2, 20.0, 'active'),
254
    (3, 30.0, 'inactive'),
255
    (5, 50.0, 'pending')
256
], ['id', 'value', 'status'])
257

258
# Compare with Spark
259
compare = datacompy.SparkSQLCompare(
260
    spark, df1, df2,
261
    join_columns=['id'],
262
    abs_tol=0.1
263
)
264

265
print(f"DataFrames match: {compare.matches()}")
266
print(compare.report())
267
```
268

269
### Snowflake DataFrame Comparison
270

271
```python
272
from snowflake.snowpark import Session
273
import datacompy
274

275
# Create Snowflake session
276
session = Session.builder.configs({
277
    'account': 'your_account',
278
    'user': 'your_user',
279
    'password': 'your_password',
280
    'database': 'your_database',
281
    'schema': 'your_schema'
282
}).create()
283

284
# Compare tables directly by name
285
compare = datacompy.SnowflakeCompare(
286
    session,
287
    df1='table1',  # Table name
288
    df2='table2',  # Table name
289
    join_columns=['id'],
290
    abs_tol=0.1
291
)
292

293
# Or compare DataFrame objects
294
df1 = session.table('table1')
295
df2 = session.table('table2')
296

297
compare = datacompy.SnowflakeCompare(
298
    session, df1, df2,
299
    join_columns=['id'],
300
    abs_tol=0.1
301
)
302

303
print(f"DataFrames match: {compare.matches()}")
304
print(compare.report())
305
```
306

307
## Backend-Specific Considerations
308

309
### Polars Optimizations
310
- Leverages Polars' lazy evaluation for memory efficiency
311
- Optimized string handling with native Polars string operations
312
- Type system mapping for accurate comparisons
313

314
### Spark Distributed Processing
315
- Comparison operations distributed across Spark cluster
316
- Optimized join strategies for large datasets
317
- Checkpoint support for iterative operations
318

319
### Snowflake Cloud Integration
320
- Pushdown operations to Snowflake compute
321
- Direct table name support without loading data
322
- Integration with Snowflake's native data types and functions

Version

Tile

Files

multi-backend-comparison.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

multi-backend-comparison.mddocs/