or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

analysis-components.mdconfiguration.mdconsole-interface.mdcore-profiling.mdindex.mdpandas-integration.mdreport-comparison.md

pandas-integration.mddocs/

0

# Pandas Integration

1

2

Direct integration with pandas DataFrames through monkey patching that adds profiling capability directly to pandas DataFrames. This integration allows pandas DataFrames to generate profiling reports directly through a `.profile_report()` method, providing seamless integration with pandas workflows.

3

4

## Capabilities

5

6

### Profile Report Method

7

8

The `.profile_report()` method is automatically added to all pandas DataFrame instances when ydata_profiling is imported. This method provides direct access to profiling functionality without needing to explicitly create a ProfileReport instance.

9

10

```python { .api }

11

def profile_report(

12

self,

13

minimal: bool = False,

14

tsmode: bool = False,

15

sortby: Optional[str] = None,

16

sensitive: bool = False,

17

explorative: bool = False,

18

sample: Optional[dict] = None,

19

config_file: Optional[Union[Path, str]] = None,

20

lazy: bool = True,

21

typeset: Optional[VisionsTypeset] = None,

22

summarizer: Optional[BaseSummarizer] = None,

23

config: Optional[Settings] = None,

24

type_schema: Optional[dict] = None,

25

**kwargs

26

) -> ProfileReport:

27

"""

28

Generate a comprehensive profiling report for this DataFrame.

29

30

This method is automatically added to pandas DataFrame instances

31

when ydata_profiling is imported via monkey patching.

32

33

Parameters:

34

- minimal: use minimal computation mode for faster processing

35

- tsmode: enable time-series analysis for numerical variables

36

- sortby: column to sort by for time-series analysis

37

- sensitive: enable privacy mode hiding sensitive values

38

- explorative: enable additional exploratory features

39

- sample: sampling configuration dictionary

40

- config_file: path to YAML configuration file

41

- lazy: defer computation until needed

42

- typeset: custom type inference system

43

- summarizer: custom statistical summarizer

44

- config: Settings object for configuration

45

- type_schema: manual type specifications

46

- **kwargs: additional configuration parameters

47

48

Returns:

49

ProfileReport instance containing comprehensive analysis

50

"""

51

```

52

53

**Usage Example:**

54

55

```python

56

import pandas as pd

57

from ydata_profiling import ProfileReport

58

59

# Load data

60

df = pd.read_csv('data.csv')

61

62

# Generate report using the decorator method

63

report = df.profile_report(title="My Dataset Report")

64

65

# Export report

66

report.to_file("report.html")

67

68

# Generate with custom configuration

69

report = df.profile_report(

70

title="Detailed Analysis",

71

explorative=True,

72

minimal=False

73

)

74

```

75

76

### Automatic Method Addition

77

78

When ydata_profiling is imported, the `profile_report()` method is automatically added to all pandas DataFrame instances.

79

80

**Usage Example:**

81

82

```python

83

import pandas as pd

84

85

# This will NOT work - profile_report method not available yet

86

# df = pd.read_csv('data.csv')

87

# report = df.profile_report() # AttributeError

88

89

# Import ydata_profiling to add the method

90

from ydata_profiling import ProfileReport

91

92

# Now the method is available on all DataFrames

93

df = pd.read_csv('data.csv')

94

report = df.profile_report() # Works!

95

96

# The method is available on any DataFrame

97

df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

98

report2 = df2.profile_report(title="Simple DataFrame")

99

```

100

101

### Integration with Pandas Workflows

102

103

Seamless integration with common pandas data analysis workflows.

104

105

**Data Cleaning Workflow:**

106

107

```python

108

import pandas as pd

109

from ydata_profiling import ProfileReport

110

111

# Load and explore data

112

df = pd.read_csv('messy_data.csv')

113

114

# Initial profiling

115

initial_report = df.profile_report(title="Initial Data Assessment")

116

initial_report.to_file("initial_analysis.html")

117

118

# Clean data based on profiling insights

119

df_cleaned = df.dropna(subset=['important_column'])

120

df_cleaned = df_cleaned[df_cleaned['age'] >= 0] # Remove negative ages

121

df_cleaned = df_cleaned.drop_duplicates()

122

123

# Profile cleaned data

124

cleaned_report = df_cleaned.profile_report(title="Cleaned Data")

125

cleaned_report.to_file("cleaned_analysis.html")

126

127

# Compare before and after

128

comparison = initial_report.compare(cleaned_report)

129

comparison.to_file("cleaning_impact.html")

130

```

131

132

**Exploratory Data Analysis Workflow:**

133

134

```python

135

import pandas as pd

136

from ydata_profiling import ProfileReport

137

138

# Load data

139

df = pd.read_csv('customer_data.csv')

140

141

# Quick exploration with minimal mode for large datasets

142

quick_profile = df.profile_report(

143

title="Quick Customer Data Overview",

144

minimal=True

145

)

146

147

# Detailed analysis after initial insights

148

detailed_profile = df.profile_report(

149

title="Comprehensive Customer Analysis",

150

explorative=True,

151

tsmode=True if 'timestamp' in df.columns else False,

152

sortby='timestamp' if 'timestamp' in df.columns else None

153

)

154

155

detailed_profile.to_file("customer_analysis.html")

156

157

# Access specific insights

158

duplicates = detailed_profile.get_duplicates()

159

print(f"Found {len(duplicates)} duplicate customers")

160

```

161

162

163

### Method Chaining Support

164

165

The pandas integration supports method chaining for fluid data analysis workflows.

166

167

**Usage Example:**

168

169

```python

170

import pandas as pd

171

from ydata_profiling import ProfileReport

172

173

# Method chaining with profiling

174

report = (pd.read_csv('data.csv')

175

.dropna()

176

.reset_index(drop=True)

177

.profile_report(title="Processed Data Analysis"))

178

179

# Chain with other pandas operations

180

processed_report = (df

181

.query('age >= 18')

182

.groupby('category')

183

.first()

184

.reset_index()

185

.profile_report(title="Adult Customers by Category"))

186

187

# Export results

188

report.to_file("processed_analysis.html")

189

processed_report.to_file("category_analysis.html")

190

```

191

192

### Jupyter Notebook Integration

193

194

Enhanced integration with Jupyter notebooks through the pandas decorator.

195

196

**Usage Example:**

197

198

```python

199

import pandas as pd

200

from ydata_profiling import ProfileReport

201

202

# Load data in notebook

203

df = pd.read_csv('analysis_data.csv')

204

205

# Generate and display report inline

206

report = df.profile_report(title="Notebook Analysis")

207

208

# Display directly in notebook cell

209

report.to_notebook_iframe()

210

211

# Or use widgets for interactive exploration

212

report.to_widgets()

213

214

# Quick minimal analysis for fast iteration

215

df.profile_report(minimal=True).to_notebook_iframe()

216

```

217

218

### Integration with Data Pipeline

219

220

Using pandas integration in data processing pipelines.

221

222

**Usage Example:**

223

224

```python

225

import pandas as pd

226

from ydata_profiling import ProfileReport

227

228

def analyze_dataset(file_path: str, output_dir: str) -> dict:

229

"""

230

Analyze dataset and return summary metrics.

231

"""

232

# Load data

233

df = pd.read_csv(file_path)

234

235

# Generate profile

236

report = df.profile_report(

237

title=f"Analysis of {file_path}",

238

explorative=True

239

)

240

241

# Save report

242

report_path = f"{output_dir}/analysis.html"

243

report.to_file(report_path)

244

245

# Extract key metrics

246

description = report.get_description()

247

248

return {

249

'rows': description.table['n'],

250

'columns': description.table['p'],

251

'missing_cells': description.table['n_cells_missing'],

252

'duplicates': description.table['n_duplicates'],

253

'report_path': report_path

254

}

255

256

# Use in pipeline

257

metrics = analyze_dataset('input/data.csv', 'output/')

258

print(f"Dataset has {metrics['rows']} rows and {metrics['columns']} columns")

259

```

260

261

### Memory-Efficient Processing

262

263

Optimized usage patterns for large datasets using pandas integration.

264

265

**Usage Example:**

266

267

```python

268

import pandas as pd

269

from ydata_profiling import ProfileReport

270

271

# For large datasets, use minimal mode initially

272

large_df = pd.read_csv('large_dataset.csv')

273

274

# Quick assessment with minimal resources

275

quick_report = large_df.profile_report(

276

minimal=True,

277

title="Large Dataset - Quick Assessment"

278

)

279

280

# Sample subset for detailed analysis if needed

281

sample_df = large_df.sample(n=10000, random_state=42)

282

detailed_report = sample_df.profile_report(

283

title="Detailed Analysis - Sample",

284

explorative=True

285

)

286

287

# Process in chunks for very large datasets

288

chunk_reports = []

289

for chunk in pd.read_csv('very_large_dataset.csv', chunksize=5000):

290

chunk_report = chunk.profile_report(minimal=True)

291

chunk_reports.append(chunk_report)

292

293

# Compare chunks to identify data consistency

294

if len(chunk_reports) >= 2:

295

chunk_comparison = chunk_reports[0].compare(chunk_reports[1])

296

chunk_comparison.to_file("chunk_consistency.html")

297

```