or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

column-utilities.mddistributed-comparison.mdindex.mdmulti-backend-comparison.mdpandas-comparison.mdreporting.md

multi-backend-comparison.mddocs/

0

# Multi-Backend DataFrame Comparison

1

2

Comparison classes for Polars, Spark, and Snowflake DataFrames, providing the same functionality as Pandas comparison but optimized for each backend's specific characteristics and capabilities.

3

4

## Capabilities

5

6

### Polars DataFrame Comparison

7

8

High-performance DataFrame comparison for Polars, leveraging Polars' optimized computation engine while maintaining the same API as Pandas comparison.

9

10

```python { .api }

11

class PolarsCompare(BaseCompare):

12

"""Comparison class for Polars DataFrames."""

13

14

def __init__(

15

self,

16

df1: pl.DataFrame,

17

df2: pl.DataFrame,

18

join_columns: List[str] | str,

19

abs_tol: float | Dict[str, float] = 0,

20

rel_tol: float | Dict[str, float] = 0,

21

df1_name: str = "df1",

22

df2_name: str = "df2",

23

ignore_spaces: bool = False,

24

ignore_case: bool = False,

25

cast_column_names_lower: bool = True

26

):

27

"""

28

Parameters:

29

- df1: First Polars DataFrame to compare

30

- df2: Second Polars DataFrame to compare

31

- join_columns: Column(s) to join dataframes on

32

- abs_tol: Absolute tolerance for numeric comparisons

33

- rel_tol: Relative tolerance for numeric comparisons

34

- df1_name: Display name for first DataFrame

35

- df2_name: Display name for second DataFrame

36

- ignore_spaces: Strip whitespace from string columns

37

- ignore_case: Ignore case in string comparisons

38

- cast_column_names_lower: Convert column names to lowercase

39

"""

40

```

41

42

#### Polars-Specific Properties

43

44

```python { .api }

45

@property

46

def df1(self) -> pl.DataFrame:

47

"""Get the first Polars dataframe."""

48

49

@property

50

def df2(self) -> pl.DataFrame:

51

"""Get the second Polars dataframe."""

52

53

# Attributes

54

df1_unq_rows: pl.DataFrame # Rows only in df1

55

df2_unq_rows: pl.DataFrame # Rows only in df2

56

intersect_rows: pl.DataFrame # Shared rows with match indicators

57

column_stats: List[Dict[str, Any]] # Column comparison statistics

58

```

59

60

### Spark SQL DataFrame Comparison

61

62

Distributed DataFrame comparison for Spark SQL DataFrames, enabling comparison of large-scale datasets with Spark's distributed computing capabilities.

63

64

```python { .api }

65

class SparkSQLCompare(BaseCompare):

66

"""Comparison class for Spark SQL DataFrames."""

67

68

def __init__(

69

self,

70

spark_session: pyspark.sql.SparkSession,

71

df1: pyspark.sql.DataFrame,

72

df2: pyspark.sql.DataFrame,

73

join_columns: List[str] | str,

74

abs_tol: float | Dict[str, float] = 0,

75

rel_tol: float | Dict[str, float] = 0,

76

df1_name: str = "df1",

77

df2_name: str = "df2",

78

ignore_spaces: bool = False,

79

ignore_case: bool = False,

80

cast_column_names_lower: bool = True

81

):

82

"""

83

Parameters:

84

- spark_session: Active Spark session

85

- df1: First Spark DataFrame to compare

86

- df2: Second Spark DataFrame to compare

87

- join_columns: Column(s) to join dataframes on

88

- abs_tol: Absolute tolerance for numeric comparisons

89

- rel_tol: Relative tolerance for numeric comparisons

90

- df1_name: Display name for first DataFrame

91

- df2_name: Display name for second DataFrame

92

- ignore_spaces: Strip whitespace from string columns

93

- ignore_case: Ignore case in string comparisons

94

- cast_column_names_lower: Convert column names to lowercase

95

"""

96

```

97

98

#### Spark-Specific Properties

99

100

```python { .api }

101

@property

102

def df1(self) -> pyspark.sql.DataFrame:

103

"""Get the first Spark dataframe."""

104

105

@property

106

def df2(self) -> pyspark.sql.DataFrame:

107

"""Get the second Spark dataframe."""

108

109

# Attributes

110

df1_unq_rows: pyspark.sql.DataFrame # Rows only in df1

111

df2_unq_rows: pyspark.sql.DataFrame # Rows only in df2

112

intersect_rows: pyspark.sql.DataFrame # Shared rows with match indicators

113

column_stats: List # Column comparison statistics

114

```

115

116

### Snowflake DataFrame Comparison

117

118

Cloud-native DataFrame comparison for Snowflake DataFrames via Snowpark, enabling comparison of data directly in Snowflake's cloud data platform.

119

120

```python { .api }

121

class SnowflakeCompare(BaseCompare):

122

"""Comparison class for Snowflake DataFrames."""

123

124

def __init__(

125

self,

126

session: sp.Session,

127

df1: Union[str, sp.DataFrame],

128

df2: Union[str, sp.DataFrame],

129

join_columns: List[str] | str | None = None,

130

abs_tol: float | Dict[str, float] = 0,

131

rel_tol: float | Dict[str, float] = 0,

132

df1_name: str | None = None,

133

df2_name: str | None = None,

134

ignore_spaces: bool = False

135

):

136

"""

137

Parameters:

138

- session: Snowflake session object

139

- df1: First DataFrame or table name

140

- df2: Second DataFrame or table name

141

- join_columns: Column(s) to join dataframes on

142

- abs_tol: Absolute tolerance for numeric comparisons

143

- rel_tol: Relative tolerance for numeric comparisons

144

- df1_name: Display name for first DataFrame

145

- df2_name: Display name for second DataFrame

146

- ignore_spaces: Strip whitespace from string columns

147

"""

148

```

149

150

#### Snowflake-Specific Properties

151

152

```python { .api }

153

@property

154

def df1(self) -> sp.DataFrame:

155

"""Get the first Snowpark dataframe."""

156

157

@property

158

def df2(self) -> sp.DataFrame:

159

"""Get the second Snowpark dataframe."""

160

161

# Attributes

162

df1_unq_rows: sp.DataFrame # Rows only in df1

163

df2_unq_rows: sp.DataFrame # Rows only in df2

164

intersect_rows: sp.DataFrame # Shared rows with match indicators

165

column_stats: List[Dict[str, Any]] # Column comparison statistics

166

```

167

168

### Common Methods

169

170

All multi-backend comparison classes share the same method signatures as the Pandas Compare class:

171

172

```python { .api }

173

# Column analysis

174

def df1_unq_columns(self) -> OrderedSet[str]: ...

175

def df2_unq_columns(self) -> OrderedSet[str]: ...

176

def intersect_columns(self) -> OrderedSet[str]: ...

177

def all_columns_match(self) -> bool: ...

178

179

# Row analysis

180

def all_rows_overlap(self) -> bool: ...

181

def count_matching_rows(self) -> int: ...

182

def intersect_rows_match(self) -> bool: ...

183

184

# Matching validation

185

def matches(self, ignore_extra_columns: bool = False) -> bool: ...

186

def subset(self) -> bool: ...

187

188

# Mismatch analysis

189

def sample_mismatch(self, column: str, sample_count: int = 10, for_display: bool = False) -> Any: ...

190

def all_mismatch(self, ignore_matching_cols: bool = False) -> Any: ...

191

192

# Reporting

193

def report(

194

self,

195

sample_count: int = 10,

196

column_count: int = 10,

197

html_file: str | None = None,

198

template_path: str | None = None

199

) -> str: ...

200

```

201

202

## Usage Examples

203

204

### Polars DataFrame Comparison

205

206

```python

207

import polars as pl

208

import datacompy

209

210

# Create Polars DataFrames

211

df1 = pl.DataFrame({

212

'id': [1, 2, 3, 4],

213

'value': [10.0, 20.0, 30.0, 40.0],

214

'status': ['active', 'active', 'inactive', 'active']

215

})

216

217

df2 = pl.DataFrame({

218

'id': [1, 2, 3, 5],

219

'value': [10.1, 20.0, 30.0, 50.0],

220

'status': ['active', 'active', 'inactive', 'pending']

221

})

222

223

# Compare with Polars

224

compare = datacompy.PolarsCompare(

225

df1, df2,

226

join_columns=['id'],

227

abs_tol=0.1

228

)

229

230

print(f"DataFrames match: {compare.matches()}")

231

print(compare.report())

232

```

233

234

### Spark DataFrame Comparison

235

236

```python

237

from pyspark.sql import SparkSession

238

import datacompy

239

240

# Initialize Spark session

241

spark = SparkSession.builder.appName("DataComPy").getOrCreate()

242

243

# Create Spark DataFrames

244

df1 = spark.createDataFrame([

245

(1, 10.0, 'active'),

246

(2, 20.0, 'active'),

247

(3, 30.0, 'inactive'),

248

(4, 40.0, 'active')

249

], ['id', 'value', 'status'])

250

251

df2 = spark.createDataFrame([

252

(1, 10.1, 'active'),

253

(2, 20.0, 'active'),

254

(3, 30.0, 'inactive'),

255

(5, 50.0, 'pending')

256

], ['id', 'value', 'status'])

257

258

# Compare with Spark

259

compare = datacompy.SparkSQLCompare(

260

spark, df1, df2,

261

join_columns=['id'],

262

abs_tol=0.1

263

)

264

265

print(f"DataFrames match: {compare.matches()}")

266

print(compare.report())

267

```

268

269

### Snowflake DataFrame Comparison

270

271

```python

272

from snowflake.snowpark import Session

273

import datacompy

274

275

# Create Snowflake session

276

session = Session.builder.configs({

277

'account': 'your_account',

278

'user': 'your_user',

279

'password': 'your_password',

280

'database': 'your_database',

281

'schema': 'your_schema'

282

}).create()

283

284

# Compare tables directly by name

285

compare = datacompy.SnowflakeCompare(

286

session,

287

df1='table1', # Table name

288

df2='table2', # Table name

289

join_columns=['id'],

290

abs_tol=0.1

291

)

292

293

# Or compare DataFrame objects

294

df1 = session.table('table1')

295

df2 = session.table('table2')

296

297

compare = datacompy.SnowflakeCompare(

298

session, df1, df2,

299

join_columns=['id'],

300

abs_tol=0.1

301

)

302

303

print(f"DataFrames match: {compare.matches()}")

304

print(compare.report())

305

```

306

307

## Backend-Specific Considerations

308

309

### Polars Optimizations

310

- Leverages Polars' lazy evaluation for memory efficiency

311

- Optimized string handling with native Polars string operations

312

- Type system mapping for accurate comparisons

313

314

### Spark Distributed Processing

315

- Comparison operations distributed across Spark cluster

316

- Optimized join strategies for large datasets

317

- Checkpoint support for iterative operations

318

319

### Snowflake Cloud Integration

320

- Pushdown operations to Snowflake compute

321

- Direct table name support without loading data

322

- Integration with Snowflake's native data types and functions