or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

analysis-components.mdconfiguration.mdconsole-interface.mdcore-profiling.mdindex.mdpandas-integration.mdreport-comparison.md

analysis-components.mddocs/

0

# Analysis Components

1

2

Detailed statistical analysis components including correlation analysis, missing data patterns, duplicate detection, and specialized analysis for different data types. These components form the analytical engine behind YData Profiling's comprehensive data understanding capabilities.

3

4

## Capabilities

5

6

### Base Description

7

8

Core data structure containing complete dataset analysis results and statistical summaries.

9

10

```python { .api }

11

class BaseDescription:

12

"""

13

Complete dataset description containing all analysis results.

14

15

Contains statistical summaries, data quality metrics, correlations,

16

missing data patterns, duplicate analysis, and variable-specific insights.

17

"""

18

19

# Core properties

20

analysis: BaseAnalysis

21

table: dict

22

variables: dict

23

correlations: dict

24

missing: dict

25

alerts: List[Alert]

26

package: dict

27

28

def __init__(self, analysis: BaseAnalysis, table: dict, variables: dict, **kwargs):

29

"""Initialize BaseDescription with analysis results."""

30

```

31

32

**Usage Example:**

33

34

```python

35

from ydata_profiling import ProfileReport

36

37

report = ProfileReport(df)

38

description = report.get_description()

39

40

# Access analysis components

41

print(f"Dataset shape: {description.table['n']}, {description.table['p']}")

42

print(f"Missing cells: {description.table['n_cells_missing']}")

43

print(f"Duplicate rows: {description.table['n_duplicates']}")

44

45

# Access variable-specific analysis

46

for var_name, var_data in description.variables.items():

47

print(f"Variable {var_name}: {var_data['type']}")

48

```

49

50

### Statistical Summarizers

51

52

Statistical computation engines that perform the actual analysis of datasets.

53

54

```python { .api }

55

class BaseSummarizer:

56

"""

57

Base interface for statistical summarizers.

58

59

Defines the contract for implementing custom analysis engines

60

for different data backends (pandas, Spark, etc.).

61

"""

62

63

def summarize(self, config: Settings, df: Union[pd.DataFrame, Any]) -> BaseDescription:

64

"""

65

Perform statistical analysis on the dataset.

66

67

Parameters:

68

- config: configuration settings for analysis

69

- df: dataset to analyze

70

71

Returns:

72

BaseDescription containing complete analysis results

73

"""

74

75

class ProfilingSummarizer(BaseSummarizer):

76

"""

77

Default profiling summarizer with comprehensive statistical analysis.

78

79

Implements univariate analysis, correlation analysis, missing data

80

patterns, duplicate detection, and data quality assessment.

81

"""

82

83

def __init__(self, typeset: Optional[VisionsTypeset] = None):

84

"""

85

Initialize ProfilingSummarizer.

86

87

Parameters:

88

- typeset: custom type system for variable classification

89

"""

90

```

91

92

**Usage Example:**

93

94

```python

95

from ydata_profiling.model.summarizer import ProfilingSummarizer

96

from ydata_profiling.config import Settings

97

from ydata_profiling.model.typeset import ProfilingTypeSet

98

99

# Create custom summarizer

100

typeset = ProfilingTypeSet()

101

summarizer = ProfilingSummarizer(typeset=typeset)

102

103

# Use with ProfileReport

104

config = Settings()

105

report = ProfileReport(df, summarizer=summarizer, config=config)

106

107

# Access summarizer results

108

description = report.get_description()

109

```

110

111

### Summary Formatting

112

113

Functions for formatting and processing analysis results.

114

115

```python { .api }

116

def format_summary(description: BaseDescription) -> dict:

117

"""

118

Format analysis summary for display and export.

119

120

Parameters:

121

- description: BaseDescription containing analysis results

122

123

Returns:

124

Formatted dictionary with human-readable summaries

125

"""

126

127

def redact_summary(description_dict: dict, config: Settings) -> dict:

128

"""

129

Redact sensitive information from analysis summary.

130

131

Parameters:

132

- description_dict: dictionary containing analysis results

133

- config: configuration specifying redaction rules

134

135

Returns:

136

Dictionary with sensitive information redacted

137

"""

138

```

139

140

**Usage Example:**

141

142

```python

143

from ydata_profiling.model.summarizer import format_summary, redact_summary

144

145

report = ProfileReport(df)

146

description = report.get_description()

147

148

# Format summary for display

149

formatted = format_summary(description)

150

print(formatted['table'])

151

152

# Redact sensitive information

153

config = Settings()

154

config.variables.text.redact = True

155

redacted = redact_summary(description.__dict__, config)

156

```

157

158

### Alert System

159

160

Data quality alert system for identifying potential issues and anomalies in datasets.

161

162

```python { .api }

163

from enum import Enum

164

165

class AlertType(Enum):

166

"""

167

Types of data quality alerts that can be generated.

168

"""

169

CONSTANT = "CONSTANT"

170

ZEROS = "ZEROS"

171

HIGH_CORRELATION = "HIGH_CORRELATION"

172

HIGH_CARDINALITY = "HIGH_CARDINALITY"

173

IMBALANCE = "IMBALANCE"

174

MISSING = "MISSING"

175

INFINITE = "INFINITE"

176

SKEWED = "SKEWED"

177

UNIQUE = "UNIQUE"

178

UNIFORM = "UNIFORM"

179

DUPLICATES = "DUPLICATES"

180

181

class Alert:

182

"""

183

Individual data quality alert with details and recommendations.

184

"""

185

186

def __init__(

187

self,

188

alert_type: AlertType,

189

column_name: str,

190

description: str,

191

**kwargs

192

):

193

"""

194

Create a data quality alert.

195

196

Parameters:

197

- alert_type: type of alert from AlertType enum

198

- column_name: name of column triggering alert

199

- description: human-readable description of issue

200

- **kwargs: additional alert metadata

201

"""

202

203

alert_type: AlertType

204

column_name: str

205

description: str

206

values: dict

207

```

208

209

**Usage Example:**

210

211

```python

212

from ydata_profiling.model.alerts import Alert, AlertType

213

214

report = ProfileReport(df)

215

description = report.get_description()

216

217

# Access all alerts

218

alerts = description.alerts

219

print(f"Found {len(alerts)} data quality alerts")

220

221

# Filter alerts by type

222

missing_alerts = [a for a in alerts if a.alert_type == AlertType.MISSING]

223

correlation_alerts = [a for a in alerts if a.alert_type == AlertType.HIGH_CORRELATION]

224

225

# Examine specific alerts

226

for alert in alerts:

227

print(f"Alert: {alert.alert_type.value}")

228

print(f"Column: {alert.column_name}")

229

print(f"Description: {alert.description}")

230

```

231

232

### Correlation Analysis

233

234

Comprehensive correlation analysis supporting multiple correlation methods and backends.

235

236

```python { .api }

237

class CorrelationBackend:

238

"""Base class for correlation computation backends."""

239

240

def compute(self, df: pd.DataFrame, config: Settings) -> dict:

241

"""

242

Compute correlations for the dataset.

243

244

Parameters:

245

- df: dataset to analyze

246

- config: correlation configuration

247

248

Returns:

249

Dictionary containing correlation matrices and metadata

250

"""

251

252

class Correlation:

253

"""Base correlation analysis class."""

254

pass

255

256

class Auto(Correlation):

257

"""Automatic correlation method selection based on data types."""

258

pass

259

260

class Spearman(Correlation):

261

"""Spearman rank correlation analysis."""

262

pass

263

264

class Pearson(Correlation):

265

"""Pearson product-moment correlation analysis."""

266

pass

267

268

class Kendall(Correlation):

269

"""Kendall tau correlation analysis."""

270

pass

271

272

class Cramers(Correlation):

273

"""Cramer's V correlation for categorical variables."""

274

pass

275

276

class PhiK(Correlation):

277

"""PhiK correlation analysis for mixed data types."""

278

pass

279

```

280

281

**Usage Example:**

282

283

```python

284

from ydata_profiling.model.correlations import Pearson, Spearman, PhiK

285

286

report = ProfileReport(df)

287

description = report.get_description()

288

289

# Access correlation results

290

correlations = description.correlations

291

292

# Check available correlation methods

293

for method, results in correlations.items():

294

if results is not None:

295

print(f"{method} correlation matrix shape: {results['matrix'].shape}")

296

297

# Access specific correlation matrix

298

if 'pearson' in correlations:

299

pearson_matrix = correlations['pearson']['matrix']

300

print("Pearson correlation matrix:")

301

print(pearson_matrix.head())

302

```

303

304

### Type System Integration

305

306

Custom type system for intelligent data type inference and variable classification.

307

308

```python { .api }

309

class ProfilingTypeSet:

310

"""

311

Custom visions typeset optimized for data profiling.

312

313

Extends base visions typeset with profiling-specific type

314

inference rules and variable classification logic.

315

"""

316

317

def __init__(self):

318

"""Initialize ProfilingTypeSet with profiling-specific types."""

319

320

def infer_type(self, series: pd.Series) -> str:

321

"""

322

Infer the profiling type of a pandas Series.

323

324

Parameters:

325

- series: pandas Series to analyze

326

327

Returns:

328

String representing the inferred profiling type

329

"""

330

```

331

332

**Usage Example:**

333

334

```python

335

from ydata_profiling.model.typeset import ProfilingTypeSet

336

import pandas as pd

337

338

# Create custom typeset

339

typeset = ProfilingTypeSet()

340

341

# Use with ProfileReport

342

report = ProfileReport(df, typeset=typeset)

343

344

# Access type inference results

345

description = report.get_description()

346

for var_name, var_info in description.variables.items():

347

print(f"{var_name}: {var_info['type']}")

348

```

349

350

### Sample Management

351

352

Data sampling functionality for handling large datasets and providing representative samples.

353

354

```python { .api }

355

class Sample:

356

"""

357

Data sampling functionality for report generation.

358

359

Provides head, tail, and random sampling strategies

360

for including representative data in reports.

361

"""

362

363

def __init__(self, sample_config: dict):

364

"""

365

Initialize Sample with configuration.

366

367

Parameters:

368

- sample_config: dictionary containing sampling parameters

369

"""

370

371

def get_sample(self, df: pd.DataFrame) -> dict:

372

"""

373

Generate samples from the dataset.

374

375

Parameters:

376

- df: dataset to sample

377

378

Returns:

379

Dictionary containing different sample types

380

"""

381

```

382

383

**Usage Example:**

384

385

```python

386

# Configure sampling in ProfileReport

387

sample_config = {

388

"head": 10,

389

"tail": 10,

390

"random": 10

391

}

392

393

report = ProfileReport(df, sample=sample_config)

394

395

# Access samples

396

samples = report.get_sample()

397

print("Head sample:")

398

print(samples['head'])

399

print("\nRandom sample:")

400

print(samples['random'])

401

```

402

403

### Analysis Metadata

404

405

Base analysis metadata containing dataset-level information and processing details.

406

407

```python { .api }

408

class BaseAnalysis:

409

"""

410

Base analysis metadata containing dataset-level information.

411

412

Stores metadata about the analysis process, data source,

413

and processing configuration.

414

"""

415

416

def __init__(self, df: pd.DataFrame, sample: dict):

417

"""

418

Initialize BaseAnalysis with dataset metadata.

419

420

Parameters:

421

- df: source dataset

422

- sample: sampling configuration

423

"""

424

425

# Analysis metadata

426

title: str

427

date_start: datetime

428

date_end: datetime

429

duration: float

430

431

class TimeIndexAnalysis(BaseAnalysis):

432

"""

433

Time series analysis metadata for time-indexed datasets.

434

435

Extends BaseAnalysis with time series specific metadata

436

including temporal patterns and seasonality detection.

437

"""

438

439

def __init__(self, df: pd.DataFrame, sample: dict, time_index: str):

440

"""

441

Initialize TimeIndexAnalysis.

442

443

Parameters:

444

- df: time-indexed dataset

445

- sample: sampling configuration

446

- time_index: name of time index column

447

"""

448

```

449

450

**Usage Example:**

451

452

```python

453

report = ProfileReport(df, tsmode=True, sortby='timestamp')

454

description = report.get_description()

455

456

# Access analysis metadata

457

analysis = description.analysis

458

print(f"Analysis duration: {analysis.duration}s")

459

print(f"Analysis started: {analysis.date_start}")

460

461

# For time series analysis

462

if hasattr(analysis, 'time_index'):

463

print(f"Time index column: {analysis.time_index}")

464

```