or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

column-utilities.mddistributed-comparison.mdindex.mdmulti-backend-comparison.mdpandas-comparison.mdreporting.md

column-utilities.mddocs/

0

# Column-Level Comparison Utilities

1

2

Low-level functions for comparing individual columns and performing specialized comparisons, useful for custom comparison logic and integration with other data processing workflows.

3

4

## Capabilities

5

6

### Pandas Column Comparison

7

8

Functions for comparing individual Pandas Series with tolerance support and specialized handling for different data types.

9

10

```python { .api }

11

def columns_equal(

12

col_1: pd.Series[Any],

13

col_2: pd.Series[Any],

14

rel_tol: float = 0,

15

abs_tol: float = 0,

16

ignore_spaces: bool = False,

17

ignore_case: bool = False

18

) -> pd.Series[bool]:

19

"""

20

Compare two Pandas Series element-wise.

21

22

Parameters:

23

- col_1: First Series to compare

24

- col_2: Second Series to compare

25

- rel_tol: Relative tolerance for numeric comparisons

26

- abs_tol: Absolute tolerance for numeric comparisons

27

- ignore_spaces: Strip whitespace from string values

28

- ignore_case: Ignore case in string comparisons

29

30

Returns:

31

Boolean Series indicating element-wise equality

32

"""

33

34

def compare_string_and_date_columns(

35

col_1: pd.Series[Any],

36

col_2: pd.Series[Any]

37

) -> pd.Series[bool]:

38

"""

39

Compare string and date columns with specialized logic.

40

41

Parameters:

42

- col_1: First Series (string or date type)

43

- col_2: Second Series (string or date type)

44

45

Returns:

46

Boolean Series indicating element-wise equality

47

"""

48

49

def calculate_max_diff(col_1: pd.Series[Any], col_2: pd.Series[Any]) -> float:

50

"""

51

Calculate maximum absolute difference between numeric columns.

52

53

Parameters:

54

- col_1: First numeric Series

55

- col_2: Second numeric Series

56

57

Returns:

58

Maximum absolute difference as float

59

"""

60

```

61

62

### DataFrame Utility Functions

63

64

Helper functions for DataFrame manipulation and analysis during comparison operations.

65

66

```python { .api }

67

def get_merged_columns(

68

original_df: pd.DataFrame,

69

merged_df: pd.DataFrame,

70

suffix: str

71

) -> List[str]:

72

"""

73

Get column names from merged DataFrame with specific suffix.

74

75

Parameters:

76

- original_df: Original DataFrame before merge

77

- merged_df: DataFrame after merge operation

78

- suffix: Suffix used in merge operation

79

80

Returns:

81

List of column names with the specified suffix

82

"""

83

84

def generate_id_within_group(

85

dataframe: pd.DataFrame,

86

join_columns: List[str]

87

) -> pd.Series[int]:

88

"""

89

Generate unique identifiers within groups for deduplication.

90

91

Parameters:

92

- dataframe: DataFrame to generate IDs for

93

- join_columns: Columns defining the groups

94

95

Returns:

96

Series of integer IDs unique within each group

97

"""

98

```

99

100

### String Processing Utilities

101

102

Functions for normalizing string data during comparisons.

103

104

```python { .api }

105

def normalize_string_column(

106

column: pd.Series,

107

ignore_spaces: bool,

108

ignore_case: bool

109

) -> pd.Series:

110

"""

111

Normalize string column for comparison.

112

113

Parameters:

114

- column: String Series to normalize

115

- ignore_spaces: Strip leading/trailing whitespace

116

- ignore_case: Convert to lowercase

117

118

Returns:

119

Normalized string Series

120

"""

121

```

122

123

### Polars Column Comparison

124

125

Specialized functions for comparing Polars Series with optimized performance.

126

127

```python { .api }

128

def columns_equal(

129

col_1: pl.Series,

130

col_2: pl.Series,

131

rel_tol: float = 0,

132

abs_tol: float = 0,

133

ignore_spaces: bool = False,

134

ignore_case: bool = False

135

) -> pl.Series:

136

"""

137

Compare two Polars Series element-wise.

138

139

Parameters:

140

- col_1: First Polars Series to compare

141

- col_2: Second Polars Series to compare

142

- rel_tol: Relative tolerance for numeric comparisons

143

- abs_tol: Absolute tolerance for numeric comparisons

144

- ignore_spaces: Strip whitespace from string values

145

- ignore_case: Ignore case in string comparisons

146

147

Returns:

148

Boolean Polars Series indicating element-wise equality

149

"""

150

151

def calculate_max_diff(col_1: pl.Series, col_2: pl.Series) -> float:

152

"""

153

Calculate maximum absolute difference between numeric Polars Series.

154

155

Parameters:

156

- col_1: First numeric Polars Series

157

- col_2: Second numeric Polars Series

158

159

Returns:

160

Maximum absolute difference as float

161

"""

162

163

def normalize_string_column(

164

column: pl.Series,

165

ignore_spaces: bool,

166

ignore_case: bool

167

) -> pl.Series:

168

"""

169

Normalize Polars string Series for comparison.

170

171

Parameters:

172

- column: String Polars Series to normalize

173

- ignore_spaces: Strip leading/trailing whitespace

174

- ignore_case: Convert to lowercase

175

176

Returns:

177

Normalized string Polars Series

178

"""

179

```

180

181

### Spark Column Comparison

182

183

Functions for comparing Spark DataFrame columns using SQL expressions.

184

185

```python { .api }

186

def columns_equal(

187

dataframe: pyspark.sql.DataFrame,

188

col_1: str,

189

col_2: str,

190

rel_tol: float = 0,

191

abs_tol: float = 0,

192

ignore_spaces: bool = False,

193

ignore_case: bool = False

194

) -> pyspark.sql.Column:

195

"""

196

Create Spark SQL Column expression for comparing two columns.

197

198

Parameters:

199

- dataframe: Spark DataFrame containing the columns

200

- col_1: Name of first column to compare

201

- col_2: Name of second column to compare

202

- rel_tol: Relative tolerance for numeric comparisons

203

- abs_tol: Absolute tolerance for numeric comparisons

204

- ignore_spaces: Strip whitespace from string values

205

- ignore_case: Ignore case in string comparisons

206

207

Returns:

208

Spark Column expression evaluating to boolean

209

"""

210

211

def calculate_max_diff(

212

dataframe: pyspark.sql.DataFrame,

213

col_1: str,

214

col_2: str

215

) -> float:

216

"""

217

Calculate maximum absolute difference between numeric columns.

218

219

Parameters:

220

- dataframe: Spark DataFrame containing the columns

221

- col_1: Name of first column

222

- col_2: Name of second column

223

224

Returns:

225

Maximum absolute difference as float

226

"""

227

228

def calculate_null_diff(

229

dataframe: pyspark.sql.DataFrame,

230

col_1: str,

231

col_2: str

232

) -> int:

233

"""

234

Calculate differences in null values between columns.

235

236

Parameters:

237

- dataframe: Spark DataFrame containing the columns

238

- col_1: Name of first column

239

- col_2: Name of second column

240

241

Returns:

242

Count of null value differences as int

243

"""

244

```

245

246

### Snowflake Column Comparison

247

248

Functions for comparing Snowflake DataFrame columns with cloud-optimized processing.

249

250

```python { .api }

251

def columns_equal(

252

dataframe: sp.DataFrame,

253

col_1: str,

254

col_2: str,

255

col_match: str,

256

rel_tol: float = 0,

257

abs_tol: float = 0,

258

ignore_spaces: bool = False

259

) -> sp.DataFrame:

260

"""

261

Compare columns in Snowflake DataFrame.

262

263

Parameters:

264

- dataframe: Snowpark DataFrame containing the columns

265

- col_1: Name of first column to compare

266

- col_2: Name of second column to compare

267

- col_match: Name of column to store match results

268

- rel_tol: Relative tolerance for numeric comparisons

269

- abs_tol: Absolute tolerance for numeric comparisons

270

- ignore_spaces: Strip whitespace from string values

271

272

Returns:

273

Snowpark DataFrame with comparison results

274

"""

275

276

def calculate_max_diff(

277

dataframe: sp.DataFrame,

278

col_1: str,

279

col_2: str

280

) -> float:

281

"""

282

Calculate maximum absolute difference between numeric columns.

283

284

Parameters:

285

- dataframe: Snowpark DataFrame containing the columns

286

- col_1: Name of first column

287

- col_2: Name of second column

288

289

Returns:

290

Maximum absolute difference as float

291

"""

292

293

def calculate_null_diff(

294

dataframe: sp.DataFrame,

295

col_1: str,

296

col_2: str

297

) -> int:

298

"""

299

Calculate differences in null values between columns.

300

301

Parameters:

302

- dataframe: Snowpark DataFrame containing the columns

303

- col_1: Name of first column

304

- col_2: Name of second column

305

306

Returns:

307

Count of null value differences as int

308

"""

309

```

310

311

## Usage Examples

312

313

### Basic Column Comparison

314

315

```python

316

import pandas as pd

317

import datacompy

318

319

# Create test Series

320

col1 = pd.Series([1.0, 2.0, 3.0, 4.0, None])

321

col2 = pd.Series([1.1, 2.0, 3.2, 4.0, None])

322

323

# Compare with tolerance

324

matches = datacompy.columns_equal(col1, col2, abs_tol=0.1)

325

print(matches) # [True, True, False, True, True]

326

327

# Calculate maximum difference

328

max_diff = datacompy.calculate_max_diff(col1, col2)

329

print(f"Maximum difference: {max_diff}") # 0.2

330

```

331

332

### String Column Comparison

333

334

```python

335

import pandas as pd

336

import datacompy

337

338

# String data with case and space variations

339

col1 = pd.Series(['Alice', 'Bob ', 'CHARLIE', 'david'])

340

col2 = pd.Series(['alice', 'Bob', 'charlie', 'David'])

341

342

# Case-sensitive comparison

343

strict_matches = datacompy.columns_equal(col1, col2)

344

print(strict_matches) # [False, False, False, False]

345

346

# Case-insensitive with space normalization

347

flexible_matches = datacompy.columns_equal(

348

col1, col2,

349

ignore_case=True,

350

ignore_spaces=True

351

)

352

print(flexible_matches) # [True, True, True, True]

353

```

354

355

### Polars Column Operations

356

357

```python

358

import polars as pl

359

import datacompy

360

361

# Create Polars Series

362

col1 = pl.Series([1.0, 2.0, 3.0, 4.0])

363

col2 = pl.Series([1.1, 2.0, 3.2, 4.0])

364

365

# Compare with tolerance

366

matches = datacompy.columns_equal(col1, col2, abs_tol=0.1)

367

print(matches)

368

369

# String normalization

370

str_col = pl.Series([' Alice ', 'BOB', 'charlie'])

371

normalized = datacompy.normalize_string_column(

372

str_col,

373

ignore_spaces=True,

374

ignore_case=True

375

)

376

print(normalized) # ['alice', 'bob', 'charlie']

377

```

378

379

### Spark Column Expressions

380

381

```python

382

from pyspark.sql import SparkSession

383

import datacompy

384

385

spark = SparkSession.builder.appName("DataComPy").getOrCreate()

386

387

# Create DataFrame

388

df = spark.createDataFrame([

389

(1, 1.0, 1.1),

390

(2, 2.0, 2.0),

391

(3, 3.0, 3.2)

392

], ['id', 'col1', 'col2'])

393

394

# Create comparison expression

395

comparison_expr = datacompy.columns_equal(

396

df, 'col1', 'col2',

397

abs_tol=0.1

398

)

399

400

# Apply comparison

401

result = df.withColumn('matches', comparison_expr)

402

result.show()

403

404

# Calculate maximum difference

405

max_diff = datacompy.calculate_max_diff(df, 'col1', 'col2')

406

print(f"Max difference: {max_diff}")

407

```

408

409

### DataFrame Utility Usage

410

411

```python

412

import pandas as pd

413

import datacompy

414

415

# Example DataFrame with duplicates

416

df = pd.DataFrame({

417

'group': ['A', 'A', 'B', 'B', 'A'],

418

'value': [1, 2, 3, 4, 5]

419

})

420

421

# Generate unique IDs within each group

422

ids = datacompy.generate_id_within_group(df, ['group'])

423

print(ids) # [0, 1, 0, 1, 2]

424

425

# Example of merged column extraction

426

df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})

427

df2 = pd.DataFrame({'a': [1, 2], 'c': [5, 6]})

428

merged = df1.merge(df2, on='a', suffixes=('_left', '_right'))

429

430

left_cols = datacompy.get_merged_columns(df1, merged, '_left')

431

print(left_cols) # ['b_left']

432

```

433

434

## Type Constants

435

436

```python { .api }

437

# Polars type identifiers

438

STRING_TYPE: List[str] = ["String", "Utf8"]

439

LIST_TYPE: List[str] = ["List", "Array"]

440

441

# Spark numeric types (internal)

442

NUMERIC_SPARK_TYPES: List # Internal Spark numeric type list

443

444

# Snowflake numeric types (internal)

445

NUMERIC_SNOWPARK_TYPES: List # Internal Snowpark numeric type list

446

```

447

448

These utility functions provide the building blocks for custom comparison logic and can be combined to create specialized comparison workflows tailored to specific data analysis needs.