Tessl Tile for pypi/ydata-profiling@4.16.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

analysis-components.md configuration.md console-interface.md core-profiling.md index.md pandas-integration.md report-comparison.md

pandas-integration.mddocs/

0
# Pandas Integration
1

2
Direct integration with pandas DataFrames through monkey patching that adds profiling capability directly to pandas DataFrames. This integration allows pandas DataFrames to generate profiling reports directly through a `.profile_report()` method, providing seamless integration with pandas workflows.
3

4
## Capabilities
5

6
### Profile Report Method
7

8
The `.profile_report()` method is automatically added to all pandas DataFrame instances when ydata_profiling is imported. This method provides direct access to profiling functionality without needing to explicitly create a ProfileReport instance.
9

10
```python { .api }
11
def profile_report(
12
    self,
13
    minimal: bool = False,
14
    tsmode: bool = False,
15
    sortby: Optional[str] = None,
16
    sensitive: bool = False,
17
    explorative: bool = False,
18
    sample: Optional[dict] = None,
19
    config_file: Optional[Union[Path, str]] = None,
20
    lazy: bool = True,
21
    typeset: Optional[VisionsTypeset] = None,
22
    summarizer: Optional[BaseSummarizer] = None,
23
    config: Optional[Settings] = None,
24
    type_schema: Optional[dict] = None,
25
    **kwargs
26
) -> ProfileReport:
27
    """
28
    Generate a comprehensive profiling report for this DataFrame.
29
    
30
    This method is automatically added to pandas DataFrame instances
31
    when ydata_profiling is imported via monkey patching.
32
    
33
    Parameters:
34
    - minimal: use minimal computation mode for faster processing
35
    - tsmode: enable time-series analysis for numerical variables
36
    - sortby: column to sort by for time-series analysis  
37
    - sensitive: enable privacy mode hiding sensitive values
38
    - explorative: enable additional exploratory features
39
    - sample: sampling configuration dictionary
40
    - config_file: path to YAML configuration file
41
    - lazy: defer computation until needed
42
    - typeset: custom type inference system
43
    - summarizer: custom statistical summarizer
44
    - config: Settings object for configuration
45
    - type_schema: manual type specifications
46
    - **kwargs: additional configuration parameters
47
    
48
    Returns:
49
    ProfileReport instance containing comprehensive analysis
50
    """
51
```
52

53
**Usage Example:**
54

55
```python
56
import pandas as pd
57
from ydata_profiling import ProfileReport
58

59
# Load data
60
df = pd.read_csv('data.csv')
61

62
# Generate report using the decorator method
63
report = df.profile_report(title="My Dataset Report")
64

65
# Export report
66
report.to_file("report.html")
67

68
# Generate with custom configuration
69
report = df.profile_report(
70
    title="Detailed Analysis",
71
    explorative=True,
72
    minimal=False
73
)
74
```
75

76
### Automatic Method Addition
77

78
When ydata_profiling is imported, the `profile_report()` method is automatically added to all pandas DataFrame instances.
79

80
**Usage Example:**
81

82
```python
83
import pandas as pd
84

85
# This will NOT work - profile_report method not available yet
86
# df = pd.read_csv('data.csv')
87
# report = df.profile_report()  # AttributeError
88

89
# Import ydata_profiling to add the method
90
from ydata_profiling import ProfileReport
91

92
# Now the method is available on all DataFrames
93
df = pd.read_csv('data.csv')
94
report = df.profile_report()  # Works!
95

96
# The method is available on any DataFrame
97
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
98
report2 = df2.profile_report(title="Simple DataFrame")
99
```
100

101
### Integration with Pandas Workflows
102

103
Seamless integration with common pandas data analysis workflows.
104

105
**Data Cleaning Workflow:**
106

107
```python
108
import pandas as pd
109
from ydata_profiling import ProfileReport
110

111
# Load and explore data
112
df = pd.read_csv('messy_data.csv')
113

114
# Initial profiling
115
initial_report = df.profile_report(title="Initial Data Assessment")
116
initial_report.to_file("initial_analysis.html")
117

118
# Clean data based on profiling insights
119
df_cleaned = df.dropna(subset=['important_column'])
120
df_cleaned = df_cleaned[df_cleaned['age'] >= 0]  # Remove negative ages
121
df_cleaned = df_cleaned.drop_duplicates()
122

123
# Profile cleaned data
124
cleaned_report = df_cleaned.profile_report(title="Cleaned Data")
125
cleaned_report.to_file("cleaned_analysis.html")
126

127
# Compare before and after
128
comparison = initial_report.compare(cleaned_report)
129
comparison.to_file("cleaning_impact.html")
130
```
131

132
**Exploratory Data Analysis Workflow:**
133

134
```python
135
import pandas as pd
136
from ydata_profiling import ProfileReport
137

138
# Load data
139
df = pd.read_csv('customer_data.csv')
140

141
# Quick exploration with minimal mode for large datasets
142
quick_profile = df.profile_report(
143
    title="Quick Customer Data Overview",
144
    minimal=True
145
)
146

147
# Detailed analysis after initial insights
148
detailed_profile = df.profile_report(
149
    title="Comprehensive Customer Analysis", 
150
    explorative=True,
151
    tsmode=True if 'timestamp' in df.columns else False,
152
    sortby='timestamp' if 'timestamp' in df.columns else None
153
)
154

155
detailed_profile.to_file("customer_analysis.html")
156

157
# Access specific insights
158
duplicates = detailed_profile.get_duplicates()
159
print(f"Found {len(duplicates)} duplicate customers")
160
```
161

162

163
### Method Chaining Support
164

165
The pandas integration supports method chaining for fluid data analysis workflows.
166

167
**Usage Example:**
168

169
```python
170
import pandas as pd
171
from ydata_profiling import ProfileReport
172

173
# Method chaining with profiling
174
report = (pd.read_csv('data.csv')
175
          .dropna()
176
          .reset_index(drop=True)
177
          .profile_report(title="Processed Data Analysis"))
178

179
# Chain with other pandas operations
180
processed_report = (df
181
                   .query('age >= 18')
182
                   .groupby('category')
183
                   .first()
184
                   .reset_index()
185
                   .profile_report(title="Adult Customers by Category"))
186

187
# Export results
188
report.to_file("processed_analysis.html")
189
processed_report.to_file("category_analysis.html")
190
```
191

192
### Jupyter Notebook Integration
193

194
Enhanced integration with Jupyter notebooks through the pandas decorator.
195

196
**Usage Example:**
197

198
```python
199
import pandas as pd
200
from ydata_profiling import ProfileReport
201

202
# Load data in notebook
203
df = pd.read_csv('analysis_data.csv')
204

205
# Generate and display report inline
206
report = df.profile_report(title="Notebook Analysis")
207

208
# Display directly in notebook cell
209
report.to_notebook_iframe()
210

211
# Or use widgets for interactive exploration
212
report.to_widgets()
213

214
# Quick minimal analysis for fast iteration
215
df.profile_report(minimal=True).to_notebook_iframe()
216
```
217

218
### Integration with Data Pipeline
219

220
Using pandas integration in data processing pipelines.
221

222
**Usage Example:**
223

224
```python
225
import pandas as pd
226
from ydata_profiling import ProfileReport
227

228
def analyze_dataset(file_path: str, output_dir: str) -> dict:
229
    """
230
    Analyze dataset and return summary metrics.
231
    """
232
    # Load data
233
    df = pd.read_csv(file_path)
234
    
235
    # Generate profile
236
    report = df.profile_report(
237
        title=f"Analysis of {file_path}",
238
        explorative=True
239
    )
240
    
241
    # Save report
242
    report_path = f"{output_dir}/analysis.html"
243
    report.to_file(report_path)
244
    
245
    # Extract key metrics
246
    description = report.get_description()
247
    
248
    return {
249
        'rows': description.table['n'],
250
        'columns': description.table['p'],
251
        'missing_cells': description.table['n_cells_missing'],
252
        'duplicates': description.table['n_duplicates'],
253
        'report_path': report_path
254
    }
255

256
# Use in pipeline
257
metrics = analyze_dataset('input/data.csv', 'output/')
258
print(f"Dataset has {metrics['rows']} rows and {metrics['columns']} columns")
259
```
260

261
### Memory-Efficient Processing
262

263
Optimized usage patterns for large datasets using pandas integration.
264

265
**Usage Example:**
266

267
```python
268
import pandas as pd
269
from ydata_profiling import ProfileReport
270

271
# For large datasets, use minimal mode initially
272
large_df = pd.read_csv('large_dataset.csv')
273

274
# Quick assessment with minimal resources
275
quick_report = large_df.profile_report(
276
    minimal=True,
277
    title="Large Dataset - Quick Assessment"
278
)
279

280
# Sample subset for detailed analysis if needed
281
sample_df = large_df.sample(n=10000, random_state=42)
282
detailed_report = sample_df.profile_report(
283
    title="Detailed Analysis - Sample",
284
    explorative=True
285
)
286

287
# Process in chunks for very large datasets
288
chunk_reports = []
289
for chunk in pd.read_csv('very_large_dataset.csv', chunksize=5000):
290
    chunk_report = chunk.profile_report(minimal=True)
291
    chunk_reports.append(chunk_report)
292

293
# Compare chunks to identify data consistency
294
if len(chunk_reports) >= 2:
295
    chunk_comparison = chunk_reports[0].compare(chunk_reports[1])
296
    chunk_comparison.to_file("chunk_consistency.html")
297
```

Version

Tile

Files

pandas-integration.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

pandas-integration.mddocs/