or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

analysis-components.mdconfiguration.mdconsole-interface.mdcore-profiling.mdindex.mdpandas-integration.mdreport-comparison.md

core-profiling.mddocs/

0

# Core Profiling

1

2

Primary functionality for generating comprehensive data profile reports from DataFrames, including statistical analysis, data quality assessment, and automated report generation with customizable analysis depth and output formats.

3

4

## Capabilities

5

6

### ProfileReport Class

7

8

Main class for creating comprehensive data profiling reports from pandas or Spark DataFrames with extensive customization options.

9

10

```python { .api }

11

class ProfileReport:

12

def __init__(

13

self,

14

df: Optional[Union[pd.DataFrame, sDataFrame]] = None,

15

minimal: bool = False,

16

tsmode: bool = False,

17

sortby: Optional[str] = None,

18

sensitive: bool = False,

19

explorative: bool = False,

20

sample: Optional[dict] = None,

21

config_file: Optional[Union[Path, str]] = None,

22

lazy: bool = True,

23

typeset: Optional[VisionsTypeset] = None,

24

summarizer: Optional[BaseSummarizer] = None,

25

config: Optional[Settings] = None,

26

type_schema: Optional[dict] = None,

27

**kwargs

28

):

29

"""

30

Generate a ProfileReport based on a pandas or spark.sql DataFrame.

31

32

Parameters:

33

- df: pandas or spark.sql DataFrame to analyze

34

- minimal: use minimal computation mode for faster processing

35

- tsmode: activate time-series analysis for numerical variables

36

- sortby: column name to sort dataset by (for time-series mode)

37

- sensitive: hide values for categorical/text variables for privacy

38

- explorative: enable additional analysis features

39

- sample: sampling configuration dictionary

40

- config_file: path to YAML configuration file

41

- lazy: defer computation until report generation

42

- typeset: custom visions typeset for type inference

43

- summarizer: custom statistical summarizer

44

- config: Settings object for configuration

45

- type_schema: manual type specification dictionary

46

- **kwargs: additional configuration parameters

47

"""

48

```

49

50

**Usage Example:**

51

52

```python

53

import pandas as pd

54

from ydata_profiling import ProfileReport

55

56

# Basic usage

57

df = pd.read_csv('data.csv')

58

report = ProfileReport(df, title="My Dataset Report")

59

60

# Minimal mode for large datasets

61

report = ProfileReport(df, minimal=True)

62

63

# Time-series analysis

64

report = ProfileReport(df, tsmode=True, sortby='timestamp')

65

66

# Custom configuration

67

report = ProfileReport(

68

df,

69

explorative=True,

70

sensitive=False,

71

title="Detailed Analysis",

72

pool_size=4

73

)

74

```

75

76

### Report Generation Methods

77

78

Methods for generating and exporting profiling reports in various formats.

79

80

```python { .api }

81

def to_file(self, output_file: Union[str, Path], silent: bool = True) -> None:

82

"""

83

Save the report to an HTML file.

84

85

Parameters:

86

- output_file: path where to save the report

87

- silent: suppress progress information

88

"""

89

90

def to_html(self) -> str:

91

"""

92

Generate HTML report content as string.

93

94

Returns:

95

Complete HTML report as string

96

"""

97

98

def to_json(self) -> str:

99

"""

100

Generate JSON representation of the report.

101

102

Returns:

103

JSON string containing all analysis results

104

"""

105

106

def to_notebook_iframe(self) -> None:

107

"""

108

Display the report in a Jupyter notebook iframe.

109

"""

110

111

def to_widgets(self) -> Any:

112

"""

113

Generate interactive Jupyter widgets for the report.

114

115

Returns:

116

Widget object for interactive exploration

117

"""

118

```

119

120

**Usage Example:**

121

122

```python

123

# Generate report

124

report = ProfileReport(df)

125

126

# Export to HTML file

127

report.to_file("my_report.html")

128

129

# Get HTML content as string

130

html_content = report.to_html()

131

132

# Get JSON representation

133

json_data = report.to_json()

134

135

# Display in Jupyter notebook

136

report.to_notebook_iframe()

137

138

# Create interactive widgets

139

widgets = report.to_widgets()

140

```

141

142

### Data Access Methods

143

144

Methods for accessing underlying data and analysis results.

145

146

```python { .api }

147

def get_description(self) -> BaseDescription:

148

"""

149

Get the complete dataset description with all analysis results.

150

151

Returns:

152

BaseDescription object containing statistical summaries,

153

correlations, missing data patterns, and data quality alerts

154

"""

155

156

def get_duplicates(self) -> Optional[pd.DataFrame]:

157

"""

158

Get duplicate rows from the dataset.

159

160

Returns:

161

DataFrame containing all duplicate rows, or None if no duplicates

162

"""

163

164

def get_sample(self) -> dict:

165

"""

166

Get data samples from the dataset.

167

168

Returns:

169

Dictionary containing head, tail, and random samples

170

"""

171

172

def get_rejected_variables(self) -> set:

173

"""

174

Get variables that were rejected during analysis.

175

176

Returns:

177

Set of column names that were rejected

178

"""

179

```

180

181

**Usage Example:**

182

183

```python

184

report = ProfileReport(df)

185

186

# Get complete analysis description

187

description = report.get_description()

188

189

# Access duplicate rows

190

duplicates = report.get_duplicates()

191

print(f"Found {len(duplicates)} duplicate rows")

192

193

# Get data samples

194

samples = report.get_sample()

195

print("Sample data:", samples['head'])

196

197

# Check rejected variables

198

rejected = report.get_rejected_variables()

199

if rejected:

200

print(f"Rejected variables: {rejected}")

201

```

202

203

### Report Management Methods

204

205

Methods for managing report state and comparisons.

206

207

```python { .api }

208

def invalidate_cache(self, subset: Optional[str] = None) -> None:

209

"""

210

Clear cached analysis results to force recomputation.

211

212

Parameters:

213

- subset: cache subset to invalidate ("rendering", "report", or None for all)

214

"""

215

216

def compare(self, other: 'ProfileReport', config: Optional[Settings] = None) -> 'ProfileReport':

217

"""

218

Compare this report with another ProfileReport.

219

220

Parameters:

221

- other: another ProfileReport to compare against

222

- config: configuration for comparison analysis

223

224

Returns:

225

New ProfileReport containing comparison results

226

"""

227

```

228

229

**Usage Example:**

230

231

```python

232

# Create reports for two datasets

233

report1 = ProfileReport(df1, title="Dataset 1")

234

report2 = ProfileReport(df2, title="Dataset 2")

235

236

# Compare reports

237

comparison = report1.compare(report2)

238

comparison.to_file("comparison_report.html")

239

240

# Force recomputation

241

report1.invalidate_cache()

242

updated_html = report1.to_html()

243

```

244

245

### Properties

246

247

Key properties for accessing report components and metadata.

248

249

```python { .api }

250

@property

251

def typeset(self) -> VisionsTypeset:

252

"""Get the typeset used for data type inference."""

253

254

@property

255

def summarizer(self) -> BaseSummarizer:

256

"""Get the statistical summarizer used for analysis."""

257

258

@property

259

def description_set(self) -> BaseDescription:

260

"""Get the complete dataset description."""

261

262

@property

263

def df_hash(self) -> str:

264

"""Get hash of the source DataFrame."""

265

266

@property

267

def report(self) -> Root:

268

"""Get the report structure object."""

269

270

@property

271

def html(self) -> str:

272

"""Get HTML report content."""

273

274

@property

275

def json(self) -> str:

276

"""Get JSON report content."""

277

278

@property

279

def widgets(self) -> Any:

280

"""Get report widgets."""

281

```

282

283

**Usage Example:**

284

285

```python

286

report = ProfileReport(df)

287

288

# Access report properties

289

print(f"Report title: {report.config.title}")

290

print(f"DataFrame hash: {report.df_hash}")

291

292

# Access analysis components

293

typeset = report.typeset

294

summarizer = report.summarizer

295

description = report.description_set

296

297

# Get report content

298

html_report = report.html

299

json_report = report.json

300

```

301

302

### Serialization Methods

303

304

Methods for serializing and deserializing ProfileReport objects for storage and transmission.

305

306

```python { .api }

307

def dumps(self) -> bytes:

308

"""

309

Serialize ProfileReport to bytes.

310

311

Returns:

312

Serialized ProfileReport as bytes

313

"""

314

315

def loads(data: bytes) -> Union['ProfileReport', 'SerializeReport']:

316

"""

317

Deserialize ProfileReport from bytes.

318

319

Parameters:

320

- data: serialized ProfileReport bytes

321

322

Returns:

323

Deserialized ProfileReport instance

324

"""

325

326

def dump(self, output_file: Union[Path, str]) -> None:

327

"""

328

Save serialized ProfileReport to file.

329

330

Parameters:

331

- output_file: path where to save the serialized report

332

"""

333

334

def load(load_file: Union[Path, str]) -> Union['ProfileReport', 'SerializeReport']:

335

"""

336

Load ProfileReport from serialized file.

337

338

Parameters:

339

- load_file: path to serialized report file

340

341

Returns:

342

Loaded ProfileReport instance

343

"""

344

```

345

346

**Usage Example:**

347

348

```python

349

import pickle

350

from pathlib import Path

351

352

# Create and serialize report

353

report = ProfileReport(df, title="My Dataset")

354

355

# Serialize to bytes

356

serialized_bytes = report.dumps()

357

358

# Save to file

359

report.dump("my_report.pkl")

360

361

# Load from file

362

loaded_report = ProfileReport.load("my_report.pkl")

363

364

# Deserialize from bytes

365

restored_report = ProfileReport.loads(serialized_bytes)

366

367

# Use loaded report

368

restored_report.to_file("restored_report.html")

369

```

370

371

### Great Expectations Integration

372

373

Integration with Great Expectations for automated data validation and expectation suite generation.

374

375

```python { .api }

376

def to_expectation_suite(

377

self,

378

suite_name: Optional[str] = None,

379

data_context: Optional[Any] = None,

380

save_suite: bool = True,

381

run_validation: bool = True,

382

build_data_docs: bool = True,

383

handler: Optional[Handler] = None

384

) -> Any:

385

"""

386

Generate Great Expectations expectation suite from profiling results.

387

388

Parameters:

389

- suite_name: name for the expectation suite

390

- data_context: Great Expectations data context

391

- save_suite: whether to save the suite to the data context

392

- run_validation: whether to run validation after creating suite

393

- build_data_docs: whether to build data docs after suite creation

394

- handler: custom handler for expectation generation

395

396

Returns:

397

Great Expectations expectation suite object

398

"""

399

```

400

401

**Usage Example:**

402

403

```python

404

import great_expectations as ge

405

from ydata_profiling import ProfileReport

406

407

# Create ProfileReport

408

report = ProfileReport(df, title="Data Validation")

409

410

# Generate Great Expectations suite

411

suite = report.to_expectation_suite(

412

suite_name="my_dataset_expectations",

413

save_suite=True,

414

run_validation=True

415

)

416

417

# The suite can now be used for ongoing data validation

418

print(f"Created expectation suite with {len(suite.expectations)} expectations")

419

```