0
# Core Profiling
1
2
Primary functionality for generating comprehensive data profile reports from DataFrames, including statistical analysis, data quality assessment, and automated report generation with customizable analysis depth and output formats.
3
4
## Capabilities
5
6
### ProfileReport Class
7
8
Main class for creating comprehensive data profiling reports from pandas or Spark DataFrames with extensive customization options.
9
10
```python { .api }
11
class ProfileReport:
12
def __init__(
13
self,
14
df: Optional[Union[pd.DataFrame, sDataFrame]] = None,
15
minimal: bool = False,
16
tsmode: bool = False,
17
sortby: Optional[str] = None,
18
sensitive: bool = False,
19
explorative: bool = False,
20
sample: Optional[dict] = None,
21
config_file: Optional[Union[Path, str]] = None,
22
lazy: bool = True,
23
typeset: Optional[VisionsTypeset] = None,
24
summarizer: Optional[BaseSummarizer] = None,
25
config: Optional[Settings] = None,
26
type_schema: Optional[dict] = None,
27
**kwargs
28
):
29
"""
30
Generate a ProfileReport based on a pandas or spark.sql DataFrame.
31
32
Parameters:
33
- df: pandas or spark.sql DataFrame to analyze
34
- minimal: use minimal computation mode for faster processing
35
- tsmode: activate time-series analysis for numerical variables
36
- sortby: column name to sort dataset by (for time-series mode)
37
- sensitive: hide values for categorical/text variables for privacy
38
- explorative: enable additional analysis features
39
- sample: sampling configuration dictionary
40
- config_file: path to YAML configuration file
41
- lazy: defer computation until report generation
42
- typeset: custom visions typeset for type inference
43
- summarizer: custom statistical summarizer
44
- config: Settings object for configuration
45
- type_schema: manual type specification dictionary
46
- **kwargs: additional configuration parameters
47
"""
48
```
49
50
**Usage Example:**
51
52
```python
53
import pandas as pd
54
from ydata_profiling import ProfileReport
55
56
# Basic usage
57
df = pd.read_csv('data.csv')
58
report = ProfileReport(df, title="My Dataset Report")
59
60
# Minimal mode for large datasets
61
report = ProfileReport(df, minimal=True)
62
63
# Time-series analysis
64
report = ProfileReport(df, tsmode=True, sortby='timestamp')
65
66
# Custom configuration
67
report = ProfileReport(
68
df,
69
explorative=True,
70
sensitive=False,
71
title="Detailed Analysis",
72
pool_size=4
73
)
74
```
75
76
### Report Generation Methods
77
78
Methods for generating and exporting profiling reports in various formats.
79
80
```python { .api }
81
def to_file(self, output_file: Union[str, Path], silent: bool = True) -> None:
82
"""
83
Save the report to an HTML file.
84
85
Parameters:
86
- output_file: path where to save the report
87
- silent: suppress progress information
88
"""
89
90
def to_html(self) -> str:
91
"""
92
Generate HTML report content as string.
93
94
Returns:
95
Complete HTML report as string
96
"""
97
98
def to_json(self) -> str:
99
"""
100
Generate JSON representation of the report.
101
102
Returns:
103
JSON string containing all analysis results
104
"""
105
106
def to_notebook_iframe(self) -> None:
107
"""
108
Display the report in a Jupyter notebook iframe.
109
"""
110
111
def to_widgets(self) -> Any:
112
"""
113
Generate interactive Jupyter widgets for the report.
114
115
Returns:
116
Widget object for interactive exploration
117
"""
118
```
119
120
**Usage Example:**
121
122
```python
123
# Generate report
124
report = ProfileReport(df)
125
126
# Export to HTML file
127
report.to_file("my_report.html")
128
129
# Get HTML content as string
130
html_content = report.to_html()
131
132
# Get JSON representation
133
json_data = report.to_json()
134
135
# Display in Jupyter notebook
136
report.to_notebook_iframe()
137
138
# Create interactive widgets
139
widgets = report.to_widgets()
140
```
141
142
### Data Access Methods
143
144
Methods for accessing underlying data and analysis results.
145
146
```python { .api }
147
def get_description(self) -> BaseDescription:
148
"""
149
Get the complete dataset description with all analysis results.
150
151
Returns:
152
BaseDescription object containing statistical summaries,
153
correlations, missing data patterns, and data quality alerts
154
"""
155
156
def get_duplicates(self) -> Optional[pd.DataFrame]:
157
"""
158
Get duplicate rows from the dataset.
159
160
Returns:
161
DataFrame containing all duplicate rows, or None if no duplicates
162
"""
163
164
def get_sample(self) -> dict:
165
"""
166
Get data samples from the dataset.
167
168
Returns:
169
Dictionary containing head, tail, and random samples
170
"""
171
172
def get_rejected_variables(self) -> set:
173
"""
174
Get variables that were rejected during analysis.
175
176
Returns:
177
Set of column names that were rejected
178
"""
179
```
180
181
**Usage Example:**
182
183
```python
184
report = ProfileReport(df)
185
186
# Get complete analysis description
187
description = report.get_description()
188
189
# Access duplicate rows
190
duplicates = report.get_duplicates()
191
print(f"Found {len(duplicates)} duplicate rows")
192
193
# Get data samples
194
samples = report.get_sample()
195
print("Sample data:", samples['head'])
196
197
# Check rejected variables
198
rejected = report.get_rejected_variables()
199
if rejected:
200
print(f"Rejected variables: {rejected}")
201
```
202
203
### Report Management Methods
204
205
Methods for managing report state and comparisons.
206
207
```python { .api }
208
def invalidate_cache(self, subset: Optional[str] = None) -> None:
209
"""
210
Clear cached analysis results to force recomputation.
211
212
Parameters:
213
- subset: cache subset to invalidate ("rendering", "report", or None for all)
214
"""
215
216
def compare(self, other: 'ProfileReport', config: Optional[Settings] = None) -> 'ProfileReport':
217
"""
218
Compare this report with another ProfileReport.
219
220
Parameters:
221
- other: another ProfileReport to compare against
222
- config: configuration for comparison analysis
223
224
Returns:
225
New ProfileReport containing comparison results
226
"""
227
```
228
229
**Usage Example:**
230
231
```python
232
# Create reports for two datasets
233
report1 = ProfileReport(df1, title="Dataset 1")
234
report2 = ProfileReport(df2, title="Dataset 2")
235
236
# Compare reports
237
comparison = report1.compare(report2)
238
comparison.to_file("comparison_report.html")
239
240
# Force recomputation
241
report1.invalidate_cache()
242
updated_html = report1.to_html()
243
```
244
245
### Properties
246
247
Key properties for accessing report components and metadata.
248
249
```python { .api }
250
@property
251
def typeset(self) -> VisionsTypeset:
252
"""Get the typeset used for data type inference."""
253
254
@property
255
def summarizer(self) -> BaseSummarizer:
256
"""Get the statistical summarizer used for analysis."""
257
258
@property
259
def description_set(self) -> BaseDescription:
260
"""Get the complete dataset description."""
261
262
@property
263
def df_hash(self) -> str:
264
"""Get hash of the source DataFrame."""
265
266
@property
267
def report(self) -> Root:
268
"""Get the report structure object."""
269
270
@property
271
def html(self) -> str:
272
"""Get HTML report content."""
273
274
@property
275
def json(self) -> str:
276
"""Get JSON report content."""
277
278
@property
279
def widgets(self) -> Any:
280
"""Get report widgets."""
281
```
282
283
**Usage Example:**
284
285
```python
286
report = ProfileReport(df)
287
288
# Access report properties
289
print(f"Report title: {report.config.title}")
290
print(f"DataFrame hash: {report.df_hash}")
291
292
# Access analysis components
293
typeset = report.typeset
294
summarizer = report.summarizer
295
description = report.description_set
296
297
# Get report content
298
html_report = report.html
299
json_report = report.json
300
```
301
302
### Serialization Methods
303
304
Methods for serializing and deserializing ProfileReport objects for storage and transmission.
305
306
```python { .api }
307
def dumps(self) -> bytes:
308
"""
309
Serialize ProfileReport to bytes.
310
311
Returns:
312
Serialized ProfileReport as bytes
313
"""
314
315
def loads(data: bytes) -> Union['ProfileReport', 'SerializeReport']:
316
"""
317
Deserialize ProfileReport from bytes.
318
319
Parameters:
320
- data: serialized ProfileReport bytes
321
322
Returns:
323
Deserialized ProfileReport instance
324
"""
325
326
def dump(self, output_file: Union[Path, str]) -> None:
327
"""
328
Save serialized ProfileReport to file.
329
330
Parameters:
331
- output_file: path where to save the serialized report
332
"""
333
334
def load(load_file: Union[Path, str]) -> Union['ProfileReport', 'SerializeReport']:
335
"""
336
Load ProfileReport from serialized file.
337
338
Parameters:
339
- load_file: path to serialized report file
340
341
Returns:
342
Loaded ProfileReport instance
343
"""
344
```
345
346
**Usage Example:**
347
348
```python
349
import pickle
350
from pathlib import Path
351
352
# Create and serialize report
353
report = ProfileReport(df, title="My Dataset")
354
355
# Serialize to bytes
356
serialized_bytes = report.dumps()
357
358
# Save to file
359
report.dump("my_report.pkl")
360
361
# Load from file
362
loaded_report = ProfileReport.load("my_report.pkl")
363
364
# Deserialize from bytes
365
restored_report = ProfileReport.loads(serialized_bytes)
366
367
# Use loaded report
368
restored_report.to_file("restored_report.html")
369
```
370
371
### Great Expectations Integration
372
373
Integration with Great Expectations for automated data validation and expectation suite generation.
374
375
```python { .api }
376
def to_expectation_suite(
377
self,
378
suite_name: Optional[str] = None,
379
data_context: Optional[Any] = None,
380
save_suite: bool = True,
381
run_validation: bool = True,
382
build_data_docs: bool = True,
383
handler: Optional[Handler] = None
384
) -> Any:
385
"""
386
Generate Great Expectations expectation suite from profiling results.
387
388
Parameters:
389
- suite_name: name for the expectation suite
390
- data_context: Great Expectations data context
391
- save_suite: whether to save the suite to the data context
392
- run_validation: whether to run validation after creating suite
393
- build_data_docs: whether to build data docs after suite creation
394
- handler: custom handler for expectation generation
395
396
Returns:
397
Great Expectations expectation suite object
398
"""
399
```
400
401
**Usage Example:**
402
403
```python
404
import great_expectations as ge
405
from ydata_profiling import ProfileReport
406
407
# Create ProfileReport
408
report = ProfileReport(df, title="Data Validation")
409
410
# Generate Great Expectations suite
411
suite = report.to_expectation_suite(
412
suite_name="my_dataset_expectations",
413
save_suite=True,
414
run_validation=True
415
)
416
417
# The suite can now be used for ongoing data validation
418
print(f"Created expectation suite with {len(suite.expectations)} expectations")
419
```