or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

analysis-components.mdconfiguration.mdconsole-interface.mdcore-profiling.mdindex.mdpandas-integration.mdreport-comparison.md

index.mddocs/

0

# YData Profiling

1

2

A comprehensive Python library that provides one-line Exploratory Data Analysis (EDA) for pandas DataFrames. YData Profiling generates detailed profile reports with statistical analysis, data quality warnings, correlations, missing data patterns, and interactive visualizations - transforming raw data understanding from hours of manual exploration into automated, publication-ready reports.

3

4

## Package Information

5

6

- **Package Name**: ydata-profiling

7

- **Language**: Python

8

- **Installation**: `pip install ydata-profiling`

9

- **Backward Compatibility**: Available as `pandas-profiling` (deprecated)

10

11

## Core Imports

12

13

```python

14

from ydata_profiling import ProfileReport

15

```

16

17

Common imports for advanced usage:

18

19

```python

20

from ydata_profiling import ProfileReport, compare, __version__

21

from ydata_profiling.config import Settings, SparkSettings

22

```

23

24

## Basic Usage

25

26

```python

27

import pandas as pd

28

from ydata_profiling import ProfileReport

29

30

# Load your data

31

df = pd.read_csv('your_data.csv')

32

33

# Generate comprehensive report with one line

34

report = ProfileReport(df, title='Dataset Analysis Report')

35

36

# Export report

37

report.to_file('data_report.html')

38

39

# Display in Jupyter notebook

40

report.to_notebook_iframe()

41

42

# Get interactive widgets

43

report.to_widgets()

44

```

45

46

## Architecture

47

48

YData Profiling uses a modular architecture for extensible data analysis:

49

50

- **ProfileReport**: Main orchestrator class managing analysis pipeline and report generation

51

- **Summarizers**: Statistical computation engines (pandas-based, Spark-compatible)

52

- **Type System**: Intelligent data type inference using visions library integration

53

- **Configuration System**: Comprehensive settings for customizing analysis depth and output

54

- **Report Generation**: Multi-format output system (HTML, JSON, widgets) with templating

55

- **Backend Support**: Pandas and Spark DataFrame compatibility for scalable analysis

56

57

This design enables automated EDA workflows, integration with data pipelines, and customization for domain-specific analysis requirements across data science and analytics teams.

58

59

## Capabilities

60

61

### Core Profiling

62

63

Primary functionality for generating comprehensive data profile reports from DataFrames, including statistical analysis, data quality assessment, and automated report generation.

64

65

```python { .api }

66

class ProfileReport:

67

def __init__(

68

self,

69

df: Optional[Union[pd.DataFrame, sDataFrame]] = None,

70

minimal: bool = False,

71

tsmode: bool = False,

72

sortby: Optional[str] = None,

73

sensitive: bool = False,

74

explorative: bool = False,

75

sample: Optional[dict] = None,

76

config_file: Optional[Union[Path, str]] = None,

77

lazy: bool = True,

78

typeset: Optional[VisionsTypeset] = None,

79

summarizer: Optional[BaseSummarizer] = None,

80

config: Optional[Settings] = None,

81

type_schema: Optional[dict] = None,

82

**kwargs

83

): ...

84

85

def to_file(self, output_file: Union[str, Path], silent: bool = True): ...

86

def to_html(self) -> str: ...

87

def to_json(self) -> str: ...

88

def to_notebook_iframe(self): ...

89

def to_widgets(self): ...

90

```

91

92

[Core Profiling](./core-profiling.md)

93

94

### Report Comparison

95

96

Compare multiple data profiling reports to identify differences, changes over time, or variations between datasets.

97

98

```python { .api }

99

def compare(

100

reports: Union[List[ProfileReport], List[BaseDescription]],

101

config: Optional[Settings] = None,

102

compute: bool = False

103

) -> ProfileReport: ...

104

```

105

106

[Report Comparison](./report-comparison.md)

107

108

### Configuration System

109

110

Comprehensive configuration system for customizing analysis depth, statistical computations, visualizations, and report output formats.

111

112

```python { .api }

113

class Settings:

114

def __init__(self, **kwargs): ...

115

116

class SparkSettings:

117

def __init__(self, **kwargs): ...

118

119

class Config:

120

@staticmethod

121

def get_config() -> Settings: ...

122

```

123

124

[Configuration](./configuration.md)

125

126

### Data Analysis Components

127

128

Detailed statistical analysis components including correlation analysis, missing data patterns, duplicate detection, and specialized analysis for different data types.

129

130

```python { .api }

131

class BaseDescription: ...

132

class BaseSummarizer: ...

133

class ProfilingSummarizer: ...

134

135

def format_summary(description: BaseDescription) -> dict: ...

136

```

137

138

[Analysis Components](./analysis-components.md)

139

140

### Pandas Integration

141

142

Direct integration with pandas DataFrames through monkey patching that adds profiling capability directly to pandas DataFrames.

143

144

```python { .api }

145

def profile_report(

146

self,

147

minimal: bool = False,

148

tsmode: bool = False,

149

sortby: Optional[str] = None,

150

sensitive: bool = False,

151

explorative: bool = False,

152

**kwargs

153

) -> ProfileReport: ...

154

```

155

156

[Pandas Integration](./pandas-integration.md)

157

158

### Serialization and Persistence

159

160

Save and load ProfileReport objects for reuse, storage, and sharing across sessions.

161

162

```python { .api }

163

def dumps(self) -> bytes: ...

164

def loads(data: bytes) -> Union['ProfileReport', 'SerializeReport']: ...

165

def dump(self, output_file: Union[Path, str]) -> None: ...

166

def load(load_file: Union[Path, str]) -> Union['ProfileReport', 'SerializeReport']: ...

167

```

168

169

**Capabilities:** Report serialization, persistent storage, cross-session report sharing, and efficient report caching for large datasets.

170

171

### Great Expectations Integration

172

173

Generate data validation expectations directly from profiling results for ongoing data quality monitoring.

174

175

```python { .api }

176

def to_expectation_suite(

177

self,

178

suite_name: Optional[str] = None,

179

data_context: Optional[Any] = None,

180

save_suite: bool = True,

181

run_validation: bool = True,

182

build_data_docs: bool = True,

183

handler: Optional[Handler] = None

184

) -> Any: ...

185

```

186

187

**Capabilities:** Automated expectation generation, data validation pipeline integration, and continuous data quality monitoring.

188

189

### Version and Package Information

190

191

Access package version and metadata for compatibility and debugging purposes.

192

193

```python { .api }

194

__version__: str # Package version string

195

```

196

197

**Usage:** Version checking, compatibility validation, and debugging support.

198

199

### Console Interface

200

201

Command-line interface for generating profiling reports directly from CSV files without writing Python code.

202

203

```bash { .api }

204

ydata_profiling [OPTIONS] INPUT_FILE OUTPUT_FILE

205

```

206

207

**Capabilities:** Direct CSV profiling, automated report generation, CI/CD pipeline integration, and shell script automation.

208

209

[Console Interface](./console-interface.md)

210

211

## Types

212

213

```python { .api }

214

from typing import Optional, Union, List, Dict, Any

215

from pathlib import Path

216

import pandas as pd

217

218

# Core DataFrame types

219

try:

220

from pyspark.sql import DataFrame as sDataFrame

221

except ImportError:

222

from typing import TypeVar

223

sDataFrame = TypeVar("sDataFrame")

224

225

# Configuration types

226

class Settings:

227

dataset: DatasetConfig

228

variables: VariablesConfig

229

correlations: CorrelationsConfig

230

plot: PlotConfig

231

html: HtmlConfig

232

style: StyleConfig

233

234

class SparkSettings(Settings):

235

"""Specialized Settings for Spark DataFrames with performance optimizations"""

236

pass

237

238

# Analysis result types

239

class BaseDescription:

240

"""Complete dataset description with analysis results"""

241

pass

242

243

class BaseAnalysis:

244

"""Base analysis metadata"""

245

pass

246

247

# Summarizer types

248

class BaseSummarizer:

249

"""Base statistical summarizer interface"""

250

pass

251

252

class ProfilingSummarizer(BaseSummarizer):

253

"""Default profiling summarizer implementation"""

254

pass

255

256

# Alert system types

257

from enum import Enum

258

259

class AlertType(Enum):

260

"""Types of data quality alerts"""

261

pass

262

263

class Alert:

264

"""Individual data quality alert"""

265

pass

266

```