or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-datacompy

Comprehensive DataFrame comparison library providing functionality equivalent to SAS's PROC COMPARE for Python with support for Pandas, Spark, Polars, Snowflake, and distributed computing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/datacompy@0.18.x

To install, run

npx @tessl/cli install tessl/pypi-datacompy@0.18.0

0

# DataComPy

1

2

DataComPy is a comprehensive DataFrame comparison library that provides functionality equivalent to SAS's PROC COMPARE for Python data analysis workflows. It supports comparison across multiple DataFrame backends including Pandas, Spark, Polars, Snowflake (via Snowpark), Dask (via Fugue), and DuckDB (via Fugue), making it a versatile tool for data validation and quality assurance.

3

4

## Package Information

5

6

- **Package Name**: datacompy

7

- **Language**: Python

8

- **Installation**: `pip install datacompy`

9

- **Optional dependencies**:

10

- Spark: `pip install datacompy[spark]`

11

- Fugue (Dask/DuckDB): `pip install datacompy[fugue]`

12

- Snowflake: `pip install datacompy[snowflake]`

13

14

## Core Imports

15

16

```python

17

import datacompy

18

```

19

20

Specific comparison classes:

21

22

```python

23

from datacompy import Compare, PolarsCompare, SparkSQLCompare, SnowflakeCompare

24

```

25

26

Utility functions:

27

28

```python

29

from datacompy import columns_equal, is_match, report, all_columns_match

30

```

31

32

## Basic Usage

33

34

```python

35

import pandas as pd

36

import datacompy

37

38

# Create sample DataFrames

39

df1 = pd.DataFrame({

40

'id': [1, 2, 3, 4],

41

'name': ['Alice', 'Bob', 'Charlie', 'David'],

42

'score': [85.5, 92.0, 78.5, 91.0]

43

})

44

45

df2 = pd.DataFrame({

46

'id': [1, 2, 3, 5],

47

'name': ['Alice', 'Bob', 'Charlie', 'Eve'],

48

'score': [85.5, 92.1, 78.5, 89.0]

49

})

50

51

# Compare DataFrames

52

compare = datacompy.Compare(df1, df2, join_columns=['id'])

53

54

# Check if DataFrames match

55

if compare.matches():

56

print("DataFrames are identical")

57

else:

58

print("DataFrames differ")

59

print(compare.report())

60

61

# Access comparison results

62

print(f"Rows in df1 only: {len(compare.df1_unq_rows)}")

63

print(f"Rows in df2 only: {len(compare.df2_unq_rows)}")

64

print(f"Shared rows: {len(compare.intersect_rows)}")

65

```

66

67

## Architecture

68

69

DataComPy uses a consistent architecture across all DataFrame backends:

70

71

- **BaseCompare**: Abstract base class defining the common comparison interface

72

- **Backend-Specific Classes**: Concrete implementations for each DataFrame library (Compare for Pandas, PolarsCompare for Polars, etc.)

73

- **Unified API**: All comparison classes share the same method signatures and behavior patterns

74

- **Tolerance Support**: Configurable absolute and relative tolerance for numeric comparisons

75

- **Distributed Computing**: Fugue integration enables scaling across multiple compute backends

76

77

This design allows seamless switching between DataFrame libraries while maintaining identical functionality and API consistency.

78

79

## Capabilities

80

81

### Pandas DataFrame Comparison

82

83

Core DataFrame comparison functionality for Pandas, including detailed statistical reporting, tolerance-based numeric comparisons, and comprehensive mismatch analysis.

84

85

```python { .api }

86

class Compare(BaseCompare):

87

def __init__(

88

self,

89

df1: pd.DataFrame,

90

df2: pd.DataFrame,

91

join_columns: List[str] | str | None = None,

92

on_index: bool = False,

93

abs_tol: float | Dict[str, float] = 0,

94

rel_tol: float | Dict[str, float] = 0,

95

df1_name: str = "df1",

96

df2_name: str = "df2",

97

ignore_spaces: bool = False,

98

ignore_case: bool = False,

99

cast_column_names_lower: bool = True

100

): ...

101

102

def matches(self, ignore_extra_columns: bool = False) -> bool: ...

103

def report(

104

self,

105

sample_count: int = 10,

106

column_count: int = 10,

107

html_file: str | None = None,

108

template_path: str | None = None

109

) -> str: ...

110

```

111

112

[Pandas DataFrame Comparison](./pandas-comparison.md)

113

114

### Multi-Backend DataFrame Comparison

115

116

Comparison classes for Polars, Spark, and Snowflake DataFrames, providing the same functionality as Pandas comparison but optimized for each backend's specific characteristics and capabilities.

117

118

```python { .api }

119

class PolarsCompare(BaseCompare): ...

120

class SparkSQLCompare(BaseCompare): ...

121

class SnowflakeCompare(BaseCompare): ...

122

```

123

124

[Multi-Backend Comparison](./multi-backend-comparison.md)

125

126

### Distributed DataFrame Comparison

127

128

Fugue-powered distributed comparison functions that work across multiple backends including Dask, DuckDB, Ray, and Arrow, enabling scalable comparison of large datasets.

129

130

```python { .api }

131

def is_match(

132

df1: AnyDataFrame,

133

df2: AnyDataFrame,

134

join_columns: str | List[str],

135

abs_tol: float = 0,

136

rel_tol: float = 0,

137

df1_name: str = "df1",

138

df2_name: str = "df2",

139

ignore_spaces: bool = False,

140

ignore_case: bool = False,

141

cast_column_names_lower: bool = True,

142

parallelism: int | None = None,

143

strict_schema: bool = False

144

) -> bool: ...

145

146

def report(

147

df1: AnyDataFrame,

148

df2: AnyDataFrame,

149

join_columns: str | List[str],

150

abs_tol: float = 0,

151

rel_tol: float = 0,

152

df1_name: str = "df1",

153

df2_name: str = "df2",

154

ignore_spaces: bool = False,

155

ignore_case: bool = False,

156

cast_column_names_lower: bool = True,

157

sample_count: int = 10,

158

column_count: int = 10,

159

html_file: str | None = None,

160

parallelism: int | None = None

161

) -> str: ...

162

```

163

164

[Distributed Comparison](./distributed-comparison.md)

165

166

### Column-Level Comparison Utilities

167

168

Low-level functions for comparing individual columns and performing specialized comparisons, useful for custom comparison logic and integration with other data processing workflows.

169

170

```python { .api }

171

def columns_equal(

172

col_1: pd.Series[Any],

173

col_2: pd.Series[Any],

174

rel_tol: float = 0,

175

abs_tol: float = 0,

176

ignore_spaces: bool = False,

177

ignore_case: bool = False

178

) -> pd.Series[bool]: ...

179

180

def calculate_max_diff(col_1: pd.Series[Any], col_2: pd.Series[Any]) -> float: ...

181

```

182

183

[Column Utilities](./column-utilities.md)

184

185

### Reporting and Output

186

187

Template-based reporting system with customizable HTML and text output, providing detailed comparison statistics, mismatch samples, and publication-ready reports.

188

189

```python { .api }

190

def render(template_name: str, **context: Any) -> str: ...

191

def save_html_report(report: str, html_file: str | Path) -> None: ...

192

def df_to_str(df: Any, sample_count: int | None, on_index: bool) -> str: ...

193

```

194

195

[Reporting System](./reporting.md)