0
# Analysis Components
1
2
Detailed statistical analysis components including correlation analysis, missing data patterns, duplicate detection, and specialized analysis for different data types. These components form the analytical engine behind YData Profiling's comprehensive data understanding capabilities.
3
4
## Capabilities
5
6
### Base Description
7
8
Core data structure containing complete dataset analysis results and statistical summaries.
9
10
```python { .api }
11
class BaseDescription:
12
"""
13
Complete dataset description containing all analysis results.
14
15
Contains statistical summaries, data quality metrics, correlations,
16
missing data patterns, duplicate analysis, and variable-specific insights.
17
"""
18
19
# Core properties
20
analysis: BaseAnalysis
21
table: dict
22
variables: dict
23
correlations: dict
24
missing: dict
25
alerts: List[Alert]
26
package: dict
27
28
def __init__(self, analysis: BaseAnalysis, table: dict, variables: dict, **kwargs):
29
"""Initialize BaseDescription with analysis results."""
30
```
31
32
**Usage Example:**
33
34
```python
35
from ydata_profiling import ProfileReport
36
37
report = ProfileReport(df)
38
description = report.get_description()
39
40
# Access analysis components
41
print(f"Dataset shape: {description.table['n']}, {description.table['p']}")
42
print(f"Missing cells: {description.table['n_cells_missing']}")
43
print(f"Duplicate rows: {description.table['n_duplicates']}")
44
45
# Access variable-specific analysis
46
for var_name, var_data in description.variables.items():
47
print(f"Variable {var_name}: {var_data['type']}")
48
```
49
50
### Statistical Summarizers
51
52
Statistical computation engines that perform the actual analysis of datasets.
53
54
```python { .api }
55
class BaseSummarizer:
56
"""
57
Base interface for statistical summarizers.
58
59
Defines the contract for implementing custom analysis engines
60
for different data backends (pandas, Spark, etc.).
61
"""
62
63
def summarize(self, config: Settings, df: Union[pd.DataFrame, Any]) -> BaseDescription:
64
"""
65
Perform statistical analysis on the dataset.
66
67
Parameters:
68
- config: configuration settings for analysis
69
- df: dataset to analyze
70
71
Returns:
72
BaseDescription containing complete analysis results
73
"""
74
75
class ProfilingSummarizer(BaseSummarizer):
76
"""
77
Default profiling summarizer with comprehensive statistical analysis.
78
79
Implements univariate analysis, correlation analysis, missing data
80
patterns, duplicate detection, and data quality assessment.
81
"""
82
83
def __init__(self, typeset: Optional[VisionsTypeset] = None):
84
"""
85
Initialize ProfilingSummarizer.
86
87
Parameters:
88
- typeset: custom type system for variable classification
89
"""
90
```
91
92
**Usage Example:**
93
94
```python
95
from ydata_profiling.model.summarizer import ProfilingSummarizer
96
from ydata_profiling.config import Settings
97
from ydata_profiling.model.typeset import ProfilingTypeSet
98
99
# Create custom summarizer
100
typeset = ProfilingTypeSet()
101
summarizer = ProfilingSummarizer(typeset=typeset)
102
103
# Use with ProfileReport
104
config = Settings()
105
report = ProfileReport(df, summarizer=summarizer, config=config)
106
107
# Access summarizer results
108
description = report.get_description()
109
```
110
111
### Summary Formatting
112
113
Functions for formatting and processing analysis results.
114
115
```python { .api }
116
def format_summary(description: BaseDescription) -> dict:
117
"""
118
Format analysis summary for display and export.
119
120
Parameters:
121
- description: BaseDescription containing analysis results
122
123
Returns:
124
Formatted dictionary with human-readable summaries
125
"""
126
127
def redact_summary(description_dict: dict, config: Settings) -> dict:
128
"""
129
Redact sensitive information from analysis summary.
130
131
Parameters:
132
- description_dict: dictionary containing analysis results
133
- config: configuration specifying redaction rules
134
135
Returns:
136
Dictionary with sensitive information redacted
137
"""
138
```
139
140
**Usage Example:**
141
142
```python
143
from ydata_profiling.model.summarizer import format_summary, redact_summary
144
145
report = ProfileReport(df)
146
description = report.get_description()
147
148
# Format summary for display
149
formatted = format_summary(description)
150
print(formatted['table'])
151
152
# Redact sensitive information
153
config = Settings()
154
config.variables.text.redact = True
155
redacted = redact_summary(description.__dict__, config)
156
```
157
158
### Alert System
159
160
Data quality alert system for identifying potential issues and anomalies in datasets.
161
162
```python { .api }
163
from enum import Enum
164
165
class AlertType(Enum):
166
"""
167
Types of data quality alerts that can be generated.
168
"""
169
CONSTANT = "CONSTANT"
170
ZEROS = "ZEROS"
171
HIGH_CORRELATION = "HIGH_CORRELATION"
172
HIGH_CARDINALITY = "HIGH_CARDINALITY"
173
IMBALANCE = "IMBALANCE"
174
MISSING = "MISSING"
175
INFINITE = "INFINITE"
176
SKEWED = "SKEWED"
177
UNIQUE = "UNIQUE"
178
UNIFORM = "UNIFORM"
179
DUPLICATES = "DUPLICATES"
180
181
class Alert:
182
"""
183
Individual data quality alert with details and recommendations.
184
"""
185
186
def __init__(
187
self,
188
alert_type: AlertType,
189
column_name: str,
190
description: str,
191
**kwargs
192
):
193
"""
194
Create a data quality alert.
195
196
Parameters:
197
- alert_type: type of alert from AlertType enum
198
- column_name: name of column triggering alert
199
- description: human-readable description of issue
200
- **kwargs: additional alert metadata
201
"""
202
203
alert_type: AlertType
204
column_name: str
205
description: str
206
values: dict
207
```
208
209
**Usage Example:**
210
211
```python
212
from ydata_profiling.model.alerts import Alert, AlertType
213
214
report = ProfileReport(df)
215
description = report.get_description()
216
217
# Access all alerts
218
alerts = description.alerts
219
print(f"Found {len(alerts)} data quality alerts")
220
221
# Filter alerts by type
222
missing_alerts = [a for a in alerts if a.alert_type == AlertType.MISSING]
223
correlation_alerts = [a for a in alerts if a.alert_type == AlertType.HIGH_CORRELATION]
224
225
# Examine specific alerts
226
for alert in alerts:
227
print(f"Alert: {alert.alert_type.value}")
228
print(f"Column: {alert.column_name}")
229
print(f"Description: {alert.description}")
230
```
231
232
### Correlation Analysis
233
234
Comprehensive correlation analysis supporting multiple correlation methods and backends.
235
236
```python { .api }
237
class CorrelationBackend:
238
"""Base class for correlation computation backends."""
239
240
def compute(self, df: pd.DataFrame, config: Settings) -> dict:
241
"""
242
Compute correlations for the dataset.
243
244
Parameters:
245
- df: dataset to analyze
246
- config: correlation configuration
247
248
Returns:
249
Dictionary containing correlation matrices and metadata
250
"""
251
252
class Correlation:
253
"""Base correlation analysis class."""
254
pass
255
256
class Auto(Correlation):
257
"""Automatic correlation method selection based on data types."""
258
pass
259
260
class Spearman(Correlation):
261
"""Spearman rank correlation analysis."""
262
pass
263
264
class Pearson(Correlation):
265
"""Pearson product-moment correlation analysis."""
266
pass
267
268
class Kendall(Correlation):
269
"""Kendall tau correlation analysis."""
270
pass
271
272
class Cramers(Correlation):
273
"""Cramer's V correlation for categorical variables."""
274
pass
275
276
class PhiK(Correlation):
277
"""PhiK correlation analysis for mixed data types."""
278
pass
279
```
280
281
**Usage Example:**
282
283
```python
284
from ydata_profiling.model.correlations import Pearson, Spearman, PhiK
285
286
report = ProfileReport(df)
287
description = report.get_description()
288
289
# Access correlation results
290
correlations = description.correlations
291
292
# Check available correlation methods
293
for method, results in correlations.items():
294
if results is not None:
295
print(f"{method} correlation matrix shape: {results['matrix'].shape}")
296
297
# Access specific correlation matrix
298
if 'pearson' in correlations:
299
pearson_matrix = correlations['pearson']['matrix']
300
print("Pearson correlation matrix:")
301
print(pearson_matrix.head())
302
```
303
304
### Type System Integration
305
306
Custom type system for intelligent data type inference and variable classification.
307
308
```python { .api }
309
class ProfilingTypeSet:
310
"""
311
Custom visions typeset optimized for data profiling.
312
313
Extends base visions typeset with profiling-specific type
314
inference rules and variable classification logic.
315
"""
316
317
def __init__(self):
318
"""Initialize ProfilingTypeSet with profiling-specific types."""
319
320
def infer_type(self, series: pd.Series) -> str:
321
"""
322
Infer the profiling type of a pandas Series.
323
324
Parameters:
325
- series: pandas Series to analyze
326
327
Returns:
328
String representing the inferred profiling type
329
"""
330
```
331
332
**Usage Example:**
333
334
```python
335
from ydata_profiling.model.typeset import ProfilingTypeSet
336
import pandas as pd
337
338
# Create custom typeset
339
typeset = ProfilingTypeSet()
340
341
# Use with ProfileReport
342
report = ProfileReport(df, typeset=typeset)
343
344
# Access type inference results
345
description = report.get_description()
346
for var_name, var_info in description.variables.items():
347
print(f"{var_name}: {var_info['type']}")
348
```
349
350
### Sample Management
351
352
Data sampling functionality for handling large datasets and providing representative samples.
353
354
```python { .api }
355
class Sample:
356
"""
357
Data sampling functionality for report generation.
358
359
Provides head, tail, and random sampling strategies
360
for including representative data in reports.
361
"""
362
363
def __init__(self, sample_config: dict):
364
"""
365
Initialize Sample with configuration.
366
367
Parameters:
368
- sample_config: dictionary containing sampling parameters
369
"""
370
371
def get_sample(self, df: pd.DataFrame) -> dict:
372
"""
373
Generate samples from the dataset.
374
375
Parameters:
376
- df: dataset to sample
377
378
Returns:
379
Dictionary containing different sample types
380
"""
381
```
382
383
**Usage Example:**
384
385
```python
386
# Configure sampling in ProfileReport
387
sample_config = {
388
"head": 10,
389
"tail": 10,
390
"random": 10
391
}
392
393
report = ProfileReport(df, sample=sample_config)
394
395
# Access samples
396
samples = report.get_sample()
397
print("Head sample:")
398
print(samples['head'])
399
print("\nRandom sample:")
400
print(samples['random'])
401
```
402
403
### Analysis Metadata
404
405
Base analysis metadata containing dataset-level information and processing details.
406
407
```python { .api }
408
class BaseAnalysis:
409
"""
410
Base analysis metadata containing dataset-level information.
411
412
Stores metadata about the analysis process, data source,
413
and processing configuration.
414
"""
415
416
def __init__(self, df: pd.DataFrame, sample: dict):
417
"""
418
Initialize BaseAnalysis with dataset metadata.
419
420
Parameters:
421
- df: source dataset
422
- sample: sampling configuration
423
"""
424
425
# Analysis metadata
426
title: str
427
date_start: datetime
428
date_end: datetime
429
duration: float
430
431
class TimeIndexAnalysis(BaseAnalysis):
432
"""
433
Time series analysis metadata for time-indexed datasets.
434
435
Extends BaseAnalysis with time series specific metadata
436
including temporal patterns and seasonality detection.
437
"""
438
439
def __init__(self, df: pd.DataFrame, sample: dict, time_index: str):
440
"""
441
Initialize TimeIndexAnalysis.
442
443
Parameters:
444
- df: time-indexed dataset
445
- sample: sampling configuration
446
- time_index: name of time index column
447
"""
448
```
449
450
**Usage Example:**
451
452
```python
453
report = ProfileReport(df, tsmode=True, sortby='timestamp')
454
description = report.get_description()
455
456
# Access analysis metadata
457
analysis = description.analysis
458
print(f"Analysis duration: {analysis.duration}s")
459
print(f"Analysis started: {analysis.date_start}")
460
461
# For time series analysis
462
if hasattr(analysis, 'time_index'):
463
print(f"Time index column: {analysis.time_index}")
464
```