Tessl Tile for pypi/ydata-profiling@4.16.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

analysis-components.md configuration.md console-interface.md core-profiling.md index.md pandas-integration.md report-comparison.md

analysis-components.mddocs/

0
# Analysis Components
1

2
Detailed statistical analysis components including correlation analysis, missing data patterns, duplicate detection, and specialized analysis for different data types. These components form the analytical engine behind YData Profiling's comprehensive data understanding capabilities.
3

4
## Capabilities
5

6
### Base Description
7

8
Core data structure containing complete dataset analysis results and statistical summaries.
9

10
```python { .api }
11
class BaseDescription:
12
    """
13
    Complete dataset description containing all analysis results.
14
    
15
    Contains statistical summaries, data quality metrics, correlations,
16
    missing data patterns, duplicate analysis, and variable-specific insights.
17
    """
18
    
19
    # Core properties
20
    analysis: BaseAnalysis
21
    table: dict
22
    variables: dict
23
    correlations: dict
24
    missing: dict
25
    alerts: List[Alert]
26
    package: dict
27
    
28
    def __init__(self, analysis: BaseAnalysis, table: dict, variables: dict, **kwargs):
29
        """Initialize BaseDescription with analysis results."""
30
```
31

32
**Usage Example:**
33

34
```python
35
from ydata_profiling import ProfileReport
36

37
report = ProfileReport(df)
38
description = report.get_description()
39

40
# Access analysis components
41
print(f"Dataset shape: {description.table['n']}, {description.table['p']}")
42
print(f"Missing cells: {description.table['n_cells_missing']}")
43
print(f"Duplicate rows: {description.table['n_duplicates']}")
44

45
# Access variable-specific analysis
46
for var_name, var_data in description.variables.items():
47
    print(f"Variable {var_name}: {var_data['type']}")
48
```
49

50
### Statistical Summarizers
51

52
Statistical computation engines that perform the actual analysis of datasets.
53

54
```python { .api }
55
class BaseSummarizer:
56
    """
57
    Base interface for statistical summarizers.
58
    
59
    Defines the contract for implementing custom analysis engines
60
    for different data backends (pandas, Spark, etc.).
61
    """
62
    
63
    def summarize(self, config: Settings, df: Union[pd.DataFrame, Any]) -> BaseDescription:
64
        """
65
        Perform statistical analysis on the dataset.
66
        
67
        Parameters:
68
        - config: configuration settings for analysis
69
        - df: dataset to analyze
70
        
71
        Returns:
72
        BaseDescription containing complete analysis results
73
        """
74

75
class ProfilingSummarizer(BaseSummarizer):  
76
    """
77
    Default profiling summarizer with comprehensive statistical analysis.
78
    
79
    Implements univariate analysis, correlation analysis, missing data
80
    patterns, duplicate detection, and data quality assessment.
81
    """
82
    
83
    def __init__(self, typeset: Optional[VisionsTypeset] = None):
84
        """
85
        Initialize ProfilingSummarizer.
86
        
87
        Parameters:
88
        - typeset: custom type system for variable classification
89
        """
90
```
91

92
**Usage Example:**
93

94
```python
95
from ydata_profiling.model.summarizer import ProfilingSummarizer
96
from ydata_profiling.config import Settings
97
from ydata_profiling.model.typeset import ProfilingTypeSet
98

99
# Create custom summarizer
100
typeset = ProfilingTypeSet()
101
summarizer = ProfilingSummarizer(typeset=typeset)
102

103
# Use with ProfileReport
104
config = Settings()
105
report = ProfileReport(df, summarizer=summarizer, config=config)
106

107
# Access summarizer results
108
description = report.get_description()
109
```
110

111
### Summary Formatting
112

113
Functions for formatting and processing analysis results.
114

115
```python { .api }
116
def format_summary(description: BaseDescription) -> dict:
117
    """
118
    Format analysis summary for display and export.
119
    
120
    Parameters:
121
    - description: BaseDescription containing analysis results
122
    
123
    Returns:
124
    Formatted dictionary with human-readable summaries
125
    """
126

127
def redact_summary(description_dict: dict, config: Settings) -> dict:
128
    """
129
    Redact sensitive information from analysis summary.
130
    
131
    Parameters:
132
    - description_dict: dictionary containing analysis results
133
    - config: configuration specifying redaction rules
134
    
135
    Returns:
136
    Dictionary with sensitive information redacted
137
    """
138
```
139

140
**Usage Example:**
141

142
```python
143
from ydata_profiling.model.summarizer import format_summary, redact_summary
144

145
report = ProfileReport(df)
146
description = report.get_description()
147

148
# Format summary for display
149
formatted = format_summary(description)
150
print(formatted['table'])
151

152
# Redact sensitive information
153
config = Settings()
154
config.variables.text.redact = True
155
redacted = redact_summary(description.__dict__, config)
156
```
157

158
### Alert System
159

160
Data quality alert system for identifying potential issues and anomalies in datasets.
161

162
```python { .api }
163
from enum import Enum
164

165
class AlertType(Enum):
166
    """
167
    Types of data quality alerts that can be generated.
168
    """
169
    CONSTANT = "CONSTANT"
170
    ZEROS = "ZEROS"
171
    HIGH_CORRELATION = "HIGH_CORRELATION"
172
    HIGH_CARDINALITY = "HIGH_CARDINALITY"
173
    IMBALANCE = "IMBALANCE"
174
    MISSING = "MISSING"
175
    INFINITE = "INFINITE"
176
    SKEWED = "SKEWED"
177
    UNIQUE = "UNIQUE"
178
    UNIFORM = "UNIFORM"
179
    DUPLICATES = "DUPLICATES"
180

181
class Alert:
182
    """
183
    Individual data quality alert with details and recommendations.
184
    """
185
    
186
    def __init__(
187
        self,
188
        alert_type: AlertType,
189
        column_name: str,
190
        description: str,
191
        **kwargs
192
    ):
193
        """
194
        Create a data quality alert.
195
        
196
        Parameters:
197
        - alert_type: type of alert from AlertType enum
198
        - column_name: name of column triggering alert
199
        - description: human-readable description of issue
200
        - **kwargs: additional alert metadata
201
        """
202
    
203
    alert_type: AlertType
204
    column_name: str
205
    description: str
206
    values: dict
207
```
208

209
**Usage Example:**
210

211
```python
212
from ydata_profiling.model.alerts import Alert, AlertType
213

214
report = ProfileReport(df)
215
description = report.get_description()
216

217
# Access all alerts
218
alerts = description.alerts
219
print(f"Found {len(alerts)} data quality alerts")
220

221
# Filter alerts by type
222
missing_alerts = [a for a in alerts if a.alert_type == AlertType.MISSING]
223
correlation_alerts = [a for a in alerts if a.alert_type == AlertType.HIGH_CORRELATION]
224

225
# Examine specific alerts
226
for alert in alerts:
227
    print(f"Alert: {alert.alert_type.value}")
228
    print(f"Column: {alert.column_name}")
229
    print(f"Description: {alert.description}")
230
```
231

232
### Correlation Analysis
233

234
Comprehensive correlation analysis supporting multiple correlation methods and backends.
235

236
```python { .api }
237
class CorrelationBackend:
238
    """Base class for correlation computation backends."""
239
    
240
    def compute(self, df: pd.DataFrame, config: Settings) -> dict:
241
        """
242
        Compute correlations for the dataset.
243
        
244
        Parameters:
245
        - df: dataset to analyze
246
        - config: correlation configuration
247
        
248
        Returns:
249
        Dictionary containing correlation matrices and metadata
250
        """
251

252
class Correlation:
253
    """Base correlation analysis class."""
254
    pass
255

256
class Auto(Correlation):
257
    """Automatic correlation method selection based on data types."""
258
    pass
259

260
class Spearman(Correlation):
261
    """Spearman rank correlation analysis."""
262
    pass
263

264
class Pearson(Correlation):  
265
    """Pearson product-moment correlation analysis."""
266
    pass
267

268
class Kendall(Correlation):
269
    """Kendall tau correlation analysis."""
270
    pass
271

272
class Cramers(Correlation):
273
    """Cramer's V correlation for categorical variables."""
274
    pass
275

276
class PhiK(Correlation):
277
    """PhiK correlation analysis for mixed data types."""
278
    pass
279
```
280

281
**Usage Example:**
282

283
```python
284
from ydata_profiling.model.correlations import Pearson, Spearman, PhiK
285

286
report = ProfileReport(df)
287
description = report.get_description()
288

289
# Access correlation results
290
correlations = description.correlations
291

292
# Check available correlation methods
293
for method, results in correlations.items():
294
    if results is not None:
295
        print(f"{method} correlation matrix shape: {results['matrix'].shape}")
296
        
297
# Access specific correlation matrix
298
if 'pearson' in correlations:
299
    pearson_matrix = correlations['pearson']['matrix']
300
    print("Pearson correlation matrix:")
301
    print(pearson_matrix.head())
302
```
303

304
### Type System Integration
305

306
Custom type system for intelligent data type inference and variable classification.
307

308
```python { .api }
309
class ProfilingTypeSet:
310
    """
311
    Custom visions typeset optimized for data profiling.
312
    
313
    Extends base visions typeset with profiling-specific type
314
    inference rules and variable classification logic.
315
    """
316
    
317
    def __init__(self):
318
        """Initialize ProfilingTypeSet with profiling-specific types."""
319
    
320
    def infer_type(self, series: pd.Series) -> str:
321
        """
322
        Infer the profiling type of a pandas Series.
323
        
324
        Parameters:
325
        - series: pandas Series to analyze
326
        
327
        Returns:
328
        String representing the inferred profiling type
329
        """
330
```
331

332
**Usage Example:**
333

334
```python
335
from ydata_profiling.model.typeset import ProfilingTypeSet
336
import pandas as pd
337

338
# Create custom typeset
339
typeset = ProfilingTypeSet()
340

341
# Use with ProfileReport  
342
report = ProfileReport(df, typeset=typeset)
343

344
# Access type inference results
345
description = report.get_description()
346
for var_name, var_info in description.variables.items():
347
    print(f"{var_name}: {var_info['type']}")
348
```
349

350
### Sample Management
351

352
Data sampling functionality for handling large datasets and providing representative samples.
353

354
```python { .api }
355
class Sample:
356
    """
357
    Data sampling functionality for report generation.
358
    
359
    Provides head, tail, and random sampling strategies
360
    for including representative data in reports.
361
    """
362
    
363
    def __init__(self, sample_config: dict):
364
        """
365
        Initialize Sample with configuration.
366
        
367
        Parameters:
368
        - sample_config: dictionary containing sampling parameters
369
        """
370
    
371
    def get_sample(self, df: pd.DataFrame) -> dict:
372
        """
373
        Generate samples from the dataset.
374
        
375
        Parameters:
376
        - df: dataset to sample
377
        
378
        Returns:
379
        Dictionary containing different sample types
380
        """
381
```
382

383
**Usage Example:**
384

385
```python
386
# Configure sampling in ProfileReport
387
sample_config = {
388
    "head": 10,
389
    "tail": 10, 
390
    "random": 10
391
}
392

393
report = ProfileReport(df, sample=sample_config)
394

395
# Access samples
396
samples = report.get_sample()
397
print("Head sample:")
398
print(samples['head'])
399
print("\nRandom sample:")  
400
print(samples['random'])
401
```
402

403
### Analysis Metadata
404

405
Base analysis metadata containing dataset-level information and processing details.
406

407
```python { .api }
408
class BaseAnalysis:
409
    """
410
    Base analysis metadata containing dataset-level information.
411
    
412
    Stores metadata about the analysis process, data source,
413
    and processing configuration.
414
    """
415
    
416
    def __init__(self, df: pd.DataFrame, sample: dict):
417
        """
418
        Initialize BaseAnalysis with dataset metadata.
419
        
420
        Parameters:
421
        - df: source dataset
422
        - sample: sampling configuration
423
        """
424
    
425
    # Analysis metadata
426
    title: str
427
    date_start: datetime
428
    date_end: datetime
429
    duration: float
430
    
431
class TimeIndexAnalysis(BaseAnalysis):
432
    """
433
    Time series analysis metadata for time-indexed datasets.
434
    
435
    Extends BaseAnalysis with time series specific metadata
436
    including temporal patterns and seasonality detection.
437
    """
438
    
439
    def __init__(self, df: pd.DataFrame, sample: dict, time_index: str):
440
        """
441
        Initialize TimeIndexAnalysis.
442
        
443
        Parameters:
444
        - df: time-indexed dataset
445
        - sample: sampling configuration  
446
        - time_index: name of time index column
447
        """
448
```
449

450
**Usage Example:**
451

452
```python
453
report = ProfileReport(df, tsmode=True, sortby='timestamp')
454
description = report.get_description()
455

456
# Access analysis metadata
457
analysis = description.analysis
458
print(f"Analysis duration: {analysis.duration}s")
459
print(f"Analysis started: {analysis.date_start}")
460

461
# For time series analysis
462
if hasattr(analysis, 'time_index'):
463
    print(f"Time index column: {analysis.time_index}")
464
```

Version

Tile

Files

analysis-components.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

analysis-components.mddocs/