or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

column-utilities.mddistributed-comparison.mdindex.mdmulti-backend-comparison.mdpandas-comparison.mdreporting.md

pandas-comparison.mddocs/

0

# Pandas DataFrame Comparison

1

2

Core DataFrame comparison functionality for Pandas DataFrames, providing detailed statistical reporting, tolerance-based numeric comparisons, and comprehensive mismatch analysis. This is the primary comparison class in DataComPy.

3

4

## Capabilities

5

6

### Compare Class

7

8

The main comparison class for Pandas DataFrames that performs comprehensive comparison analysis with configurable tolerance settings and detailed reporting.

9

10

```python { .api }

11

class Compare(BaseCompare):

12

"""Comparison class for Pandas DataFrames.

13

14

Both df1 and df2 should be dataframes containing all of the join_columns,

15

with unique column names. Differences between values are compared to

16

abs_tol + rel_tol * abs(df2['value']).

17

"""

18

19

def __init__(

20

self,

21

df1: pd.DataFrame,

22

df2: pd.DataFrame,

23

join_columns: List[str] | str | None = None,

24

on_index: bool = False,

25

abs_tol: float | Dict[str, float] = 0,

26

rel_tol: float | Dict[str, float] = 0,

27

df1_name: str = "df1",

28

df2_name: str = "df2",

29

ignore_spaces: bool = False,

30

ignore_case: bool = False,

31

cast_column_names_lower: bool = True

32

):

33

"""

34

Parameters:

35

- df1: First DataFrame to compare

36

- df2: Second DataFrame to compare

37

- join_columns: Column(s) to join dataframes on

38

- on_index: If True, join on DataFrame index instead of columns

39

- abs_tol: Absolute tolerance for numeric comparisons (float or dict)

40

- rel_tol: Relative tolerance for numeric comparisons (float or dict)

41

- df1_name: Display name for first DataFrame

42

- df2_name: Display name for second DataFrame

43

- ignore_spaces: Strip whitespace from string columns

44

- ignore_case: Ignore case in string comparisons

45

- cast_column_names_lower: Convert column names to lowercase

46

"""

47

```

48

49

### Properties and Attributes

50

51

Access to comparison results and DataFrame metadata.

52

53

```python { .api }

54

# Properties

55

@property

56

def df1(self) -> pd.DataFrame:

57

"""Get the first dataframe."""

58

59

@property

60

def df2(self) -> pd.DataFrame:

61

"""Get the second dataframe."""

62

63

# Attributes (available after comparison)

64

df1_unq_rows: pd.DataFrame # Rows only in df1

65

df2_unq_rows: pd.DataFrame # Rows only in df2

66

intersect_rows: pd.DataFrame # Shared rows with match indicators

67

column_stats: List[Dict[str, Any]] # Column-by-column comparison statistics

68

```

69

70

### Column Information Methods

71

72

Methods to analyze column structure and relationships between DataFrames.

73

74

```python { .api }

75

def df1_unq_columns(self) -> OrderedSet[str]:

76

"""Get columns that are unique to df1."""

77

78

def df2_unq_columns(self) -> OrderedSet[str]:

79

"""Get columns that are unique to df2."""

80

81

def intersect_columns(self) -> OrderedSet[str]:

82

"""Get columns that are shared between the two dataframes."""

83

84

def all_columns_match(self) -> bool:

85

"""Check if all columns match between DataFrames."""

86

```

87

88

### Row Comparison Methods

89

90

Methods to analyze row-level differences and overlaps.

91

92

```python { .api }

93

def all_rows_overlap(self) -> bool:

94

"""Check if all rows are present in both DataFrames."""

95

96

def count_matching_rows(self) -> int:

97

"""Count the number of matching rows."""

98

99

def intersect_rows_match(self) -> bool:

100

"""Check if rows that exist in both DataFrames have matching values."""

101

```

102

103

### Matching and Validation Methods

104

105

High-level methods to determine if DataFrames match according to various criteria.

106

107

```python { .api }

108

def matches(self, ignore_extra_columns: bool = False) -> bool:

109

"""

110

Check if DataFrames match completely.

111

112

Parameters:

113

- ignore_extra_columns: If True, ignore columns that exist in only one DataFrame

114

115

Returns:

116

True if DataFrames match, False otherwise

117

"""

118

119

def subset(self) -> bool:

120

"""

121

Check if df2 is a subset of df1.

122

123

Returns:

124

True if df2 is a subset of df1, False otherwise

125

"""

126

```

127

128

### Mismatch Analysis Methods

129

130

Methods to identify and analyze specific differences between DataFrames.

131

132

```python { .api }

133

def sample_mismatch(

134

self,

135

column: str,

136

sample_count: int = 10,

137

for_display: bool = False

138

) -> pd.DataFrame | None:

139

"""

140

Get a sample of mismatched values for a specific column.

141

142

Parameters:

143

- column: Name of column to sample

144

- sample_count: Number of mismatched rows to return

145

- for_display: Format output for display purposes

146

147

Returns:

148

DataFrame with sample of mismatched rows, or None if no mismatches

149

"""

150

151

def all_mismatch(self, ignore_matching_cols: bool = False) -> pd.DataFrame:

152

"""

153

Get all mismatched rows.

154

155

Parameters:

156

- ignore_matching_cols: If True, exclude columns that match completely

157

158

Returns:

159

DataFrame containing all rows with mismatches

160

"""

161

```

162

163

### Report Generation

164

165

Comprehensive reporting functionality with customizable output formats.

166

167

```python { .api }

168

def report(

169

self,

170

sample_count: int = 10,

171

column_count: int = 10,

172

html_file: str | None = None,

173

template_path: str | None = None

174

) -> str:

175

"""

176

Generate comprehensive comparison report.

177

178

Parameters:

179

- sample_count: Number of sample mismatches to include

180

- column_count: Number of columns to include in detailed stats

181

- html_file: Path to save HTML report (optional)

182

- template_path: Custom template path (optional)

183

184

Returns:

185

String containing the formatted comparison report

186

"""

187

```

188

189

## Usage Examples

190

191

### Basic Comparison

192

193

```python

194

import pandas as pd

195

import datacompy

196

197

# Create test data

198

df1 = pd.DataFrame({

199

'id': [1, 2, 3, 4],

200

'value': [10.0, 20.0, 30.0, 40.0],

201

'status': ['active', 'active', 'inactive', 'active']

202

})

203

204

df2 = pd.DataFrame({

205

'id': [1, 2, 3, 5],

206

'value': [10.1, 20.0, 30.0, 50.0],

207

'status': ['active', 'active', 'inactive', 'pending']

208

})

209

210

# Perform comparison

211

compare = datacompy.Compare(df1, df2, join_columns=['id'])

212

213

# Check results

214

print(f"DataFrames match: {compare.matches()}")

215

print(f"Rows only in df1: {len(compare.df1_unq_rows)}")

216

print(f"Rows only in df2: {len(compare.df2_unq_rows)}")

217

```

218

219

### Tolerance-Based Comparison

220

221

```python

222

# Compare with tolerance for numeric columns

223

compare = datacompy.Compare(

224

df1, df2,

225

join_columns=['id'],

226

abs_tol=0.1, # Allow 0.1 absolute difference

227

rel_tol=0.05 # Allow 5% relative difference

228

)

229

230

# Per-column tolerance

231

compare = datacompy.Compare(

232

df1, df2,

233

join_columns=['id'],

234

abs_tol={'value': 0.2, 'default': 0.1},

235

rel_tol={'value': 0.1, 'default': 0.05}

236

)

237

```

238

239

### String Comparison Options

240

241

```python

242

# Ignore case and whitespace in string comparisons

243

compare = datacompy.Compare(

244

df1, df2,

245

join_columns=['id'],

246

ignore_case=True,

247

ignore_spaces=True

248

)

249

```

250

251

### Detailed Reporting

252

253

```python

254

# Generate detailed report

255

report = compare.report(sample_count=20, column_count=15)

256

print(report)

257

258

# Save HTML report

259

compare.report(html_file='comparison_report.html')

260

261

# Get specific mismatch samples

262

value_mismatches = compare.sample_mismatch('value', sample_count=5)

263

print(value_mismatches)

264

```