0
# Core Analysis Functions
1
2
Primary functions for creating exploratory data analysis reports. These functions analyze pandas DataFrames and return DataframeReport objects containing comprehensive statistics, visualizations, and association matrices.
3
4
## Capabilities
5
6
### Single DataFrame Analysis
7
8
Analyzes a single DataFrame, generating comprehensive statistics, visualizations, and feature relationships. Optionally focuses analysis around a target feature to highlight correlations and associations.
9
10
```python { .api }
11
def analyze(source: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
12
target_feat: str = None,
13
feat_cfg: FeatureConfig = None,
14
pairwise_analysis: str = 'auto') -> DataframeReport:
15
"""
16
Analyze a single DataFrame and generate a report.
17
18
Parameters:
19
- source: DataFrame to analyze, or tuple of [DataFrame, "Display Name"]
20
- target_feat: Name of target feature for focused analysis (boolean/numerical only)
21
- feat_cfg: FeatureConfig object for controlling feature processing
22
- pairwise_analysis: Controls correlation analysis ('auto', 'on', 'off')
23
24
Returns:
25
DataframeReport object containing analysis results
26
"""
27
```
28
29
#### Usage Examples
30
31
```python
32
import sweetviz as sv
33
import pandas as pd
34
35
# Basic analysis
36
df = pd.read_csv('data.csv')
37
report = sv.analyze(df)
38
39
# Analysis with named dataset
40
report = sv.analyze([df, "My Dataset"])
41
42
# Target-focused analysis
43
report = sv.analyze(df, target_feat='outcome')
44
45
# With feature configuration
46
config = sv.FeatureConfig(skip=['id'], force_cat=['category'])
47
report = sv.analyze(df, target_feat='price', feat_cfg=config)
48
49
# Control pairwise analysis for large datasets
50
report = sv.analyze(df, pairwise_analysis='off') # Skip correlation matrix
51
```
52
53
### Dataset Comparison
54
55
Compares two datasets side-by-side, highlighting differences in distributions, statistics, and feature relationships. Ideal for comparing training/test splits or different data versions.
56
57
```python { .api }
58
def compare(source: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
59
compare: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
60
target_feat: str = None,
61
feat_cfg: FeatureConfig = None,
62
pairwise_analysis: str = 'auto') -> DataframeReport:
63
"""
64
Compare two DataFrames and generate a comparison report.
65
66
Parameters:
67
- source: Primary DataFrame or [DataFrame, "Display Name"]
68
- compare: Comparison DataFrame or [DataFrame, "Display Name"]
69
- target_feat: Name of target feature for focused analysis (boolean/numerical only)
70
- feat_cfg: FeatureConfig object for controlling feature processing
71
- pairwise_analysis: Controls correlation analysis ('auto', 'on', 'off')
72
73
Returns:
74
DataframeReport object containing comparison results
75
"""
76
```
77
78
#### Usage Examples
79
80
```python
81
# Compare training and test sets
82
train_df = pd.read_csv('train.csv')
83
test_df = pd.read_csv('test.csv')
84
85
report = sv.compare([train_df, "Training"], [test_df, "Test"])
86
87
# Compare with target analysis
88
report = sv.compare([train_df, "Training"], [test_df, "Test"], target_feat='label')
89
90
# Compare datasets with different names
91
old_data = pd.read_csv('old.csv')
92
new_data = pd.read_csv('new.csv')
93
report = sv.compare([old_data, "Previous Version"], [new_data, "Current Version"])
94
```
95
96
### Intra-Dataset Comparison
97
98
Compares subsets within the same DataFrame based on a boolean condition. Useful for analyzing differences between groups (e.g., male vs female, treatment vs control).
99
100
```python { .api }
101
def compare_intra(source_df: pd.DataFrame,
102
condition_series: pd.Series,
103
names: Tuple[str, str],
104
target_feat: str = None,
105
feat_cfg: FeatureConfig = None,
106
pairwise_analysis: str = 'auto') -> DataframeReport:
107
"""
108
Compare subsets within the same DataFrame based on a boolean condition.
109
110
Parameters:
111
- source_df: DataFrame to analyze
112
- condition_series: Boolean Series for splitting data (same length as source_df)
113
- names: Tuple of names for (True subset, False subset)
114
- target_feat: Name of target feature for focused analysis (boolean/numerical only)
115
- feat_cfg: FeatureConfig object for controlling feature processing
116
- pairwise_analysis: Controls correlation analysis ('auto', 'on', 'off')
117
118
Returns:
119
DataframeReport object containing intra-dataset comparison
120
121
Raises:
122
ValueError: If condition_series length doesn't match source_df or isn't boolean type
123
ValueError: If either subset is empty after splitting
124
"""
125
```
126
127
#### Usage Examples
128
129
```python
130
# Compare male vs female
131
df = pd.read_csv('data.csv')
132
report = sv.compare_intra(df, df["gender"] == "male", ["Male", "Female"])
133
134
# Compare with target feature
135
report = sv.compare_intra(df,
136
df["age"] > 30,
137
["Over 30", "30 and Under"],
138
target_feat="income")
139
140
# Compare treatment groups
141
report = sv.compare_intra(df,
142
df["treatment"] == "A",
143
["Treatment A", "Treatment B"],
144
target_feat="outcome")
145
146
# Complex boolean conditions
147
high_income = (df["income"] > df["income"].median())
148
report = sv.compare_intra(df, high_income, ["High Income", "Low Income"])
149
```
150
151
## Parameter Details
152
153
### target_feat Parameter
154
155
- **Supported Types**: Only boolean and numerical features can be targets
156
- **Effect**: Highlights correlations and associations with the target feature
157
- **Categorical Targets**: Not supported - use FeatureConfig to force numerical if needed
158
159
### pairwise_analysis Parameter
160
161
- **'auto'** (default): Automatically decides based on dataset size (uses association_auto_threshold)
162
- **'on'**: Forces pairwise analysis regardless of dataset size
163
- **'off'**: Skips pairwise correlation/association analysis
164
- **Performance**: Correlation analysis is O(n²) in number of features
165
166
### feat_cfg Parameter
167
168
See [Configuration](./configuration.md) for detailed FeatureConfig usage.
169
170
## Error Handling
171
172
All analysis functions may raise:
173
174
- **ValueError**: Invalid parameters, unsupported target types, empty datasets
175
- **TypeError**: Invalid data types for parameters
176
- **KeyError**: Target feature not found in DataFrame
177
- **MemoryError**: Dataset too large for available memory
178
179
Common errors and solutions:
180
181
```python
182
# Handle missing target feature
183
try:
184
report = sv.analyze(df, target_feat='nonexistent')
185
except KeyError:
186
print("Target feature not found in DataFrame")
187
188
# Handle categorical target
189
try:
190
report = sv.analyze(df, target_feat='category')
191
except ValueError as e:
192
if "CATEGORICAL" in str(e):
193
# Force to numerical if appropriate
194
config = sv.FeatureConfig(force_num=['category'])
195
report = sv.analyze(df, target_feat='category', feat_cfg=config)
196
```