0
# Pandas Integration
1
2
Direct integration with pandas DataFrames through monkey patching that adds profiling capability directly to pandas DataFrames. This integration allows pandas DataFrames to generate profiling reports directly through a `.profile_report()` method, providing seamless integration with pandas workflows.
3
4
## Capabilities
5
6
### Profile Report Method
7
8
The `.profile_report()` method is automatically added to all pandas DataFrame instances when ydata_profiling is imported. This method provides direct access to profiling functionality without needing to explicitly create a ProfileReport instance.
9
10
```python { .api }
11
def profile_report(
12
self,
13
minimal: bool = False,
14
tsmode: bool = False,
15
sortby: Optional[str] = None,
16
sensitive: bool = False,
17
explorative: bool = False,
18
sample: Optional[dict] = None,
19
config_file: Optional[Union[Path, str]] = None,
20
lazy: bool = True,
21
typeset: Optional[VisionsTypeset] = None,
22
summarizer: Optional[BaseSummarizer] = None,
23
config: Optional[Settings] = None,
24
type_schema: Optional[dict] = None,
25
**kwargs
26
) -> ProfileReport:
27
"""
28
Generate a comprehensive profiling report for this DataFrame.
29
30
This method is automatically added to pandas DataFrame instances
31
when ydata_profiling is imported via monkey patching.
32
33
Parameters:
34
- minimal: use minimal computation mode for faster processing
35
- tsmode: enable time-series analysis for numerical variables
36
- sortby: column to sort by for time-series analysis
37
- sensitive: enable privacy mode hiding sensitive values
38
- explorative: enable additional exploratory features
39
- sample: sampling configuration dictionary
40
- config_file: path to YAML configuration file
41
- lazy: defer computation until needed
42
- typeset: custom type inference system
43
- summarizer: custom statistical summarizer
44
- config: Settings object for configuration
45
- type_schema: manual type specifications
46
- **kwargs: additional configuration parameters
47
48
Returns:
49
ProfileReport instance containing comprehensive analysis
50
"""
51
```
52
53
**Usage Example:**
54
55
```python
56
import pandas as pd
57
from ydata_profiling import ProfileReport
58
59
# Load data
60
df = pd.read_csv('data.csv')
61
62
# Generate report using the decorator method
63
report = df.profile_report(title="My Dataset Report")
64
65
# Export report
66
report.to_file("report.html")
67
68
# Generate with custom configuration
69
report = df.profile_report(
70
title="Detailed Analysis",
71
explorative=True,
72
minimal=False
73
)
74
```
75
76
### Automatic Method Addition
77
78
When ydata_profiling is imported, the `profile_report()` method is automatically added to all pandas DataFrame instances.
79
80
**Usage Example:**
81
82
```python
83
import pandas as pd
84
85
# This will NOT work - profile_report method not available yet
86
# df = pd.read_csv('data.csv')
87
# report = df.profile_report() # AttributeError
88
89
# Import ydata_profiling to add the method
90
from ydata_profiling import ProfileReport
91
92
# Now the method is available on all DataFrames
93
df = pd.read_csv('data.csv')
94
report = df.profile_report() # Works!
95
96
# The method is available on any DataFrame
97
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
98
report2 = df2.profile_report(title="Simple DataFrame")
99
```
100
101
### Integration with Pandas Workflows
102
103
Seamless integration with common pandas data analysis workflows.
104
105
**Data Cleaning Workflow:**
106
107
```python
108
import pandas as pd
109
from ydata_profiling import ProfileReport
110
111
# Load and explore data
112
df = pd.read_csv('messy_data.csv')
113
114
# Initial profiling
115
initial_report = df.profile_report(title="Initial Data Assessment")
116
initial_report.to_file("initial_analysis.html")
117
118
# Clean data based on profiling insights
119
df_cleaned = df.dropna(subset=['important_column'])
120
df_cleaned = df_cleaned[df_cleaned['age'] >= 0] # Remove negative ages
121
df_cleaned = df_cleaned.drop_duplicates()
122
123
# Profile cleaned data
124
cleaned_report = df_cleaned.profile_report(title="Cleaned Data")
125
cleaned_report.to_file("cleaned_analysis.html")
126
127
# Compare before and after
128
comparison = initial_report.compare(cleaned_report)
129
comparison.to_file("cleaning_impact.html")
130
```
131
132
**Exploratory Data Analysis Workflow:**
133
134
```python
135
import pandas as pd
136
from ydata_profiling import ProfileReport
137
138
# Load data
139
df = pd.read_csv('customer_data.csv')
140
141
# Quick exploration with minimal mode for large datasets
142
quick_profile = df.profile_report(
143
title="Quick Customer Data Overview",
144
minimal=True
145
)
146
147
# Detailed analysis after initial insights
148
detailed_profile = df.profile_report(
149
title="Comprehensive Customer Analysis",
150
explorative=True,
151
tsmode=True if 'timestamp' in df.columns else False,
152
sortby='timestamp' if 'timestamp' in df.columns else None
153
)
154
155
detailed_profile.to_file("customer_analysis.html")
156
157
# Access specific insights
158
duplicates = detailed_profile.get_duplicates()
159
print(f"Found {len(duplicates)} duplicate customers")
160
```
161
162
163
### Method Chaining Support
164
165
The pandas integration supports method chaining for fluid data analysis workflows.
166
167
**Usage Example:**
168
169
```python
170
import pandas as pd
171
from ydata_profiling import ProfileReport
172
173
# Method chaining with profiling
174
report = (pd.read_csv('data.csv')
175
.dropna()
176
.reset_index(drop=True)
177
.profile_report(title="Processed Data Analysis"))
178
179
# Chain with other pandas operations
180
processed_report = (df
181
.query('age >= 18')
182
.groupby('category')
183
.first()
184
.reset_index()
185
.profile_report(title="Adult Customers by Category"))
186
187
# Export results
188
report.to_file("processed_analysis.html")
189
processed_report.to_file("category_analysis.html")
190
```
191
192
### Jupyter Notebook Integration
193
194
Enhanced integration with Jupyter notebooks through the pandas decorator.
195
196
**Usage Example:**
197
198
```python
199
import pandas as pd
200
from ydata_profiling import ProfileReport
201
202
# Load data in notebook
203
df = pd.read_csv('analysis_data.csv')
204
205
# Generate and display report inline
206
report = df.profile_report(title="Notebook Analysis")
207
208
# Display directly in notebook cell
209
report.to_notebook_iframe()
210
211
# Or use widgets for interactive exploration
212
report.to_widgets()
213
214
# Quick minimal analysis for fast iteration
215
df.profile_report(minimal=True).to_notebook_iframe()
216
```
217
218
### Integration with Data Pipeline
219
220
Using pandas integration in data processing pipelines.
221
222
**Usage Example:**
223
224
```python
225
import pandas as pd
226
from ydata_profiling import ProfileReport
227
228
def analyze_dataset(file_path: str, output_dir: str) -> dict:
229
"""
230
Analyze dataset and return summary metrics.
231
"""
232
# Load data
233
df = pd.read_csv(file_path)
234
235
# Generate profile
236
report = df.profile_report(
237
title=f"Analysis of {file_path}",
238
explorative=True
239
)
240
241
# Save report
242
report_path = f"{output_dir}/analysis.html"
243
report.to_file(report_path)
244
245
# Extract key metrics
246
description = report.get_description()
247
248
return {
249
'rows': description.table['n'],
250
'columns': description.table['p'],
251
'missing_cells': description.table['n_cells_missing'],
252
'duplicates': description.table['n_duplicates'],
253
'report_path': report_path
254
}
255
256
# Use in pipeline
257
metrics = analyze_dataset('input/data.csv', 'output/')
258
print(f"Dataset has {metrics['rows']} rows and {metrics['columns']} columns")
259
```
260
261
### Memory-Efficient Processing
262
263
Optimized usage patterns for large datasets using pandas integration.
264
265
**Usage Example:**
266
267
```python
268
import pandas as pd
269
from ydata_profiling import ProfileReport
270
271
# For large datasets, use minimal mode initially
272
large_df = pd.read_csv('large_dataset.csv')
273
274
# Quick assessment with minimal resources
275
quick_report = large_df.profile_report(
276
minimal=True,
277
title="Large Dataset - Quick Assessment"
278
)
279
280
# Sample subset for detailed analysis if needed
281
sample_df = large_df.sample(n=10000, random_state=42)
282
detailed_report = sample_df.profile_report(
283
title="Detailed Analysis - Sample",
284
explorative=True
285
)
286
287
# Process in chunks for very large datasets
288
chunk_reports = []
289
for chunk in pd.read_csv('very_large_dataset.csv', chunksize=5000):
290
chunk_report = chunk.profile_report(minimal=True)
291
chunk_reports.append(chunk_report)
292
293
# Compare chunks to identify data consistency
294
if len(chunk_reports) >= 2:
295
chunk_comparison = chunk_reports[0].compare(chunk_reports[1])
296
chunk_comparison.to_file("chunk_consistency.html")
297
```