0
# Statistical Transformations
1
2
Statistical transformations (stats) transform your data before visualization through operations like binning, density estimation, smoothing, and statistical summaries. Stats compute new variables that can be mapped to aesthetics, enabling sophisticated data visualizations that go beyond raw data plotting. Each stat has computed aesthetics that provide access to the transformed values.
3
4
## Capabilities
5
6
### Identity and Counting Stats
7
8
Basic transformations including pass-through data and counting operations.
9
10
```python { .api }
11
def stat_identity(mapping=None, data=None, **kwargs):
12
"""
13
Identity transformation (no change to data).
14
15
Use this when you want to plot data as-is without any statistical transformation.
16
"""
17
18
def stat_count(mapping=None, data=None, **kwargs):
19
"""
20
Count the number of observations at each x position.
21
22
Required aesthetics: x
23
Optional aesthetics: weight
24
25
Computed aesthetics:
26
- count: number of observations
27
- prop: proportion of total observations
28
"""
29
30
def stat_sum(mapping=None, data=None, **kwargs):
31
"""
32
Sum overlapping points and map sum to size.
33
34
Required aesthetics: x, y
35
Optional aesthetics: size, weight
36
37
Computed aesthetics:
38
- n: sum of weights (or count if no weights)
39
- prop: proportion of total
40
"""
41
42
def stat_unique(mapping=None, data=None, **kwargs):
43
"""
44
Remove duplicate rows in data.
45
46
Useful for preventing overplotting when you have duplicate points.
47
"""
48
```
49
50
### Binning and Histograms
51
52
Transform continuous data into discrete bins for histograms and related visualizations.
53
54
```python { .api }
55
def stat_bin(mapping=None, data=None, bins=30, binwidth=None, center=None,
56
boundary=None, closed='right', pad=False, **kwargs):
57
"""
58
Bin data for histograms.
59
60
Required aesthetics: x
61
Optional aesthetics: weight
62
63
Parameters:
64
- bins: int, number of bins
65
- binwidth: float, width of bins
66
- center: float, center of one bin
67
- boundary: float, boundary of one bin
68
- closed: str, which side of interval is closed ('right', 'left')
69
- pad: bool, whether to pad bins
70
71
Computed aesthetics:
72
- count: number of observations in bin
73
- density: density of observations
74
- ncount: normalized count
75
- ndensity: normalized density
76
- width: bin width
77
"""
78
79
def stat_bin_2d(mapping=None, data=None, bins=30, binwidth=None, drop=True,
80
**kwargs):
81
"""
82
2D binning for heatmaps.
83
84
Required aesthetics: x, y
85
Optional aesthetics: weight
86
87
Parameters:
88
- bins: int or tuple, number of bins in each direction
89
- binwidth: float or tuple, width of bins
90
- drop: bool, whether to drop empty bins
91
92
Computed aesthetics:
93
- count: number of observations in bin
94
- density: density of observations
95
"""
96
97
def stat_bin2d(mapping=None, data=None, bins=30, binwidth=None, drop=True,
98
**kwargs):
99
"""
100
2D binning for heatmaps - alternative name for stat_bin_2d.
101
102
Required aesthetics: x, y
103
Optional aesthetics: weight
104
105
Parameters:
106
- bins: int or tuple, number of bins in each direction
107
- binwidth: float or tuple, width of bins
108
- drop: bool, whether to drop empty bins
109
110
Computed aesthetics:
111
- count: number of observations in bin
112
- density: density of observations
113
"""
114
115
def stat_bindot(mapping=None, data=None, binaxis='x', method='dotdensity',
116
binwidth=None, **kwargs):
117
"""
118
Bin data for dot plots.
119
120
Required aesthetics: x
121
122
Parameters:
123
- binaxis: str, axis to bin along ('x', 'y')
124
- method: str, binning method ('dotdensity', 'histodot')
125
- binwidth: float, width of bins
126
127
Computed aesthetics:
128
- count: number of observations in bin
129
- binwidth: width of bin
130
"""
131
```
132
133
### Density Estimation
134
135
Compute smooth density estimates for continuous distributions.
136
137
```python { .api }
138
def stat_density(mapping=None, data=None, bw='nrd0', adjust=1, kernel='gaussian',
139
n=512, trim=False, **kwargs):
140
"""
141
Compute smooth density estimates.
142
143
Required aesthetics: x
144
Optional aesthetics: weight
145
146
Parameters:
147
- bw: str or float, bandwidth selection method or value
148
- adjust: float, bandwidth adjustment factor
149
- kernel: str, kernel function ('gaussian', 'epanechnikov', etc.)
150
- n: int, number of evaluation points
151
- trim: bool, whether to trim density to data range
152
153
Computed aesthetics:
154
- density: density estimate
155
- count: density * number of observations
156
- scaled: density scaled to maximum of 1
157
"""
158
159
def stat_density_2d(mapping=None, data=None, **kwargs):
160
"""
161
2D density estimation for contour plots.
162
163
Required aesthetics: x, y
164
Optional aesthetics: weight
165
166
Computed aesthetics:
167
- level: contour level
168
- piece: contour piece identifier
169
"""
170
171
def stat_ydensity(mapping=None, data=None, **kwargs):
172
"""
173
Density estimates for violin plots.
174
175
Required aesthetics: x, y
176
177
Computed aesthetics:
178
- density: density estimate
179
- scaled: density scaled within groups
180
- count: density * number of observations
181
- violinwidth: density scaled for violin width
182
"""
183
```
184
185
### Smoothing and Trend Lines
186
187
Fit smooth curves and trend lines to data.
188
189
```python { .api }
190
def stat_smooth(mapping=None, data=None, method='auto', formula=None, se=True,
191
n=80, span=0.75, level=0.95, **kwargs):
192
"""
193
Compute smoothed conditional means.
194
195
Required aesthetics: x, y
196
Optional aesthetics: weight
197
198
Parameters:
199
- method: str, smoothing method ('auto', 'lm', 'glm', 'gam', 'loess')
200
- formula: str, model formula (for 'lm', 'glm', 'gam')
201
- se: bool, whether to compute confidence interval
202
- n: int, number of points to evaluate
203
- span: float, smoothing span (for 'loess')
204
- level: float, confidence level
205
206
Computed aesthetics:
207
- y: predicted values
208
- ymin, ymax: confidence interval bounds (if se=True)
209
- se: standard errors
210
"""
211
212
def stat_quantile(mapping=None, data=None, quantiles=None, formula=None,
213
**kwargs):
214
"""
215
Compute quantile regression lines.
216
217
Required aesthetics: x, y
218
Optional aesthetics: weight
219
220
Parameters:
221
- quantiles: list, quantiles to compute (default: [0.25, 0.5, 0.75])
222
- formula: str, model formula
223
224
Computed aesthetics:
225
- quantile: quantile level
226
"""
227
```
228
229
### Box Plot and Summary Statistics
230
231
Compute statistical summaries for box plots and related visualizations.
232
233
```python { .api }
234
def stat_boxplot(mapping=None, data=None, coef=1.5, **kwargs):
235
"""
236
Compute box plot statistics.
237
238
Required aesthetics: x or y (one discrete, one continuous)
239
Optional aesthetics: weight
240
241
Parameters:
242
- coef: float, multiplier for outlier detection
243
244
Computed aesthetics:
245
- lower: lower hinge (25th percentile)
246
- upper: upper hinge (75th percentile)
247
- middle: median (50th percentile)
248
- ymin: lower whisker
249
- ymax: upper whisker
250
- outliers: outlier values
251
"""
252
253
def stat_summary(mapping=None, data=None, fun_data=None, fun_y=None,
254
fun_ymax=None, fun_ymin=None, **kwargs):
255
"""
256
Summarize y values at each x.
257
258
Required aesthetics: x, y
259
260
Parameters:
261
- fun_data: function, returns dict with summary statistics
262
- fun_y: function, compute y summary
263
- fun_ymax, fun_ymin: functions, compute y range
264
265
Computed aesthetics depend on functions used:
266
- y: summary statistic
267
- ymin, ymax: range statistics (if computed)
268
"""
269
270
def stat_summary_bin(mapping=None, data=None, bins=30, **kwargs):
271
"""
272
Summarize y values in bins of x.
273
274
Required aesthetics: x, y
275
276
Parameters:
277
- bins: int, number of bins
278
- fun_data, fun_y, fun_ymax, fun_ymin: summary functions
279
280
Computed aesthetics:
281
- x: bin centers
282
- y: summary statistic
283
- ymin, ymax: range statistics (if computed)
284
"""
285
```
286
287
### Geometric and Spatial Stats
288
289
Compute geometric transformations and spatial statistics.
290
291
```python { .api }
292
def stat_hull(mapping=None, data=None, **kwargs):
293
"""
294
Compute convex hull of points.
295
296
Required aesthetics: x, y
297
Optional aesthetics: group
298
299
Returns hull vertices in order for drawing polygon.
300
"""
301
302
def stat_ellipse(mapping=None, data=None, type='t', level=0.95, segments=51,
303
**kwargs):
304
"""
305
Compute confidence ellipses.
306
307
Required aesthetics: x, y
308
309
Parameters:
310
- type: str, ellipse type ('t', 'norm', 'euclid')
311
- level: float, confidence level
312
- segments: int, number of points in ellipse
313
314
Computed aesthetics:
315
- x, y: ellipse boundary points
316
"""
317
318
def stat_sina(mapping=None, data=None, **kwargs):
319
"""
320
Compute sina plot positions (jittered violin).
321
322
Required aesthetics: x, y
323
324
Positions points based on local density to create violin-like shape
325
with individual points visible.
326
"""
327
```
328
329
### Distribution and Probability Stats
330
331
Work with probability distributions and cumulative distributions.
332
333
```python { .api }
334
def stat_ecdf(mapping=None, data=None, n=None, pad=True, **kwargs):
335
"""
336
Compute empirical cumulative distribution function.
337
338
Required aesthetics: x
339
340
Parameters:
341
- n: int, number of points to evaluate (default: use all data points)
342
- pad: bool, whether to pad with additional points
343
344
Computed aesthetics:
345
- y: cumulative probability
346
"""
347
348
def stat_qq(mapping=None, data=None, distribution='norm', dparams=None, **kwargs):
349
"""
350
Compute quantile-quantile plot statistics.
351
352
Required aesthetics: sample
353
354
Parameters:
355
- distribution: str or scipy distribution, theoretical distribution
356
- dparams: tuple, distribution parameters
357
358
Computed aesthetics:
359
- theoretical: theoretical quantiles
360
- sample: sample quantiles
361
"""
362
363
def stat_qq_line(mapping=None, data=None, distribution='norm', dparams=None,
364
**kwargs):
365
"""
366
Compute reference line for Q-Q plots.
367
368
Required aesthetics: sample
369
370
Parameters:
371
- distribution: str or scipy distribution, theoretical distribution
372
- dparams: tuple, distribution parameters
373
374
Computed aesthetics:
375
- slope, intercept: line parameters
376
"""
377
```
378
379
### Function and Point Density Stats
380
381
Evaluate functions and compute point densities.
382
383
```python { .api }
384
def stat_function(mapping=None, data=None, fun=None, xlim=None, n=101,
385
args=None, **kwargs):
386
"""
387
Evaluate and plot functions.
388
389
Parameters:
390
- fun: function, function to evaluate
391
- xlim: tuple, x range to evaluate over
392
- n: int, number of points to evaluate
393
- args: tuple, additional arguments to function
394
395
Computed aesthetics:
396
- x: evaluation points
397
- y: function values
398
"""
399
400
def stat_pointdensity(mapping=None, data=None, **kwargs):
401
"""
402
Compute local point density.
403
404
Required aesthetics: x, y
405
406
Computed aesthetics:
407
- density: local point density
408
- ndensity: normalized density
409
"""
410
```
411
412
## Usage Patterns
413
414
### Using Computed Aesthetics
415
```python
416
# Map fill to computed count in histogram
417
ggplot(data, aes(x='value')) + \
418
geom_histogram(aes(fill=after_stat('count')), stat='bin', bins=20)
419
420
# Use density instead of count for histogram
421
ggplot(data, aes(x='value')) + \
422
geom_histogram(aes(y=after_stat('density')), stat='bin', bins=20)
423
424
# Color points by local density
425
ggplot(data, aes(x='x', y='y')) + \
426
geom_point(aes(color=after_stat('density')), stat='pointdensity')
427
```
428
429
### Custom Statistical Summaries
430
```python
431
# Custom summary function
432
def mean_se(x):
433
return {'y': np.mean(x), 'ymin': np.mean(x) - np.std(x)/np.sqrt(len(x)),
434
'ymax': np.mean(x) + np.std(x)/np.sqrt(len(x))}
435
436
ggplot(data, aes(x='group', y='value')) + \
437
stat_summary(fun_data=mean_se, geom='pointrange')
438
```
439
440
### Combining Stats with Geoms
441
```python
442
# Density curve with rug plot
443
ggplot(data, aes(x='value')) + \
444
stat_density(geom='line') + \
445
geom_rug(sides='b')
446
447
# Smooth with confidence band
448
ggplot(data, aes(x='x', y='y')) + \
449
geom_point(alpha=0.5) + \
450
stat_smooth(method='lm', se=True)
451
```