0
# Sample Datasets
1
2
Plotnine includes a comprehensive collection of datasets commonly used for data visualization examples, tutorials, and exploration. These datasets provide real-world data across various domains including economics, biology, automotive, and demographics.
3
4
## Import Patterns
5
6
```python
7
# Import specific datasets
8
from plotnine.data import mtcars, diamonds, economics
9
10
# Import all datasets (not recommended for production)
11
from plotnine.data import *
12
13
# Access via module reference
14
import plotnine.data as data
15
df = data.mtcars
16
```
17
18
## Capabilities
19
20
### Automotive Data
21
22
```python { .api }
23
# Motor Trend Car Road Tests - 32 automobiles (1973-74 models)
24
mtcars: pandas.DataFrame
25
# Columns: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
26
27
# Fuel economy data from 1999 and 2008 for 38 popular car models
28
mpg: pandas.DataFrame
29
# Columns: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class
30
```
31
32
**Usage Example:**
33
```python
34
from plotnine import ggplot, aes, geom_point
35
from plotnine.data import mtcars
36
37
# Scatter plot of weight vs mpg
38
plot = (ggplot(mtcars, aes(x='wt', y='mpg')) +
39
geom_point())
40
```
41
42
### Jewelry and Precious Stones
43
44
```python { .api }
45
# Prices and attributes of ~54,000 diamonds
46
diamonds: pandas.DataFrame
47
# Columns: price, carat, cut, color, clarity, x, y, z, depth, table
48
# cut: Factor with levels Fair, Good, Very Good, Premium, Ideal
49
# color: Factor with levels D, E, F, G, H, I, J (D=best, J=worst)
50
# clarity: Factor with levels I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF (I1=worst, IF=best)
51
```
52
53
### Economic Data
54
55
```python { .api }
56
# US economic time series from FRED database
57
economics: pandas.DataFrame
58
# Columns: date, psavert, pce, unemploy, uempmed, pop
59
60
# US economic data in long format for easier visualization
61
economics_long: pandas.DataFrame
62
# Same data as economics but in tidy/long format
63
```
64
65
### Biological Data
66
67
```python { .api }
68
# Palmer Penguins - 3 species from Antarctica
69
penguins: pandas.DataFrame
70
# Columns: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, year
71
# species: Factor with levels Adelie, Chinstrap, Gentoo
72
73
# Updated mammals sleep dataset
74
msleep: pandas.DataFrame
75
# Columns: name, genus, vore, order, conservation, sleep_total, sleep_rem, sleep_cycle, awake, brainwt, bodywt
76
```
77
78
### Geographic and Demographic Data
79
80
```python { .api }
81
# Midwest demographics by county
82
midwest: pandas.DataFrame
83
# Columns: PID, county, state, area, poptotal, popdensity, popwhite, popblack, etc.
84
85
# Texas housing market data from TAMU real estate center
86
txhousing: pandas.DataFrame
87
# Columns: city, year, month, sales, volume, median, listings, inventory, date
88
```
89
90
### Natural Phenomena
91
92
```python { .api }
93
# Old Faithful Geyser eruption data
94
faithful: pandas.DataFrame
95
# Columns: eruptions, waiting
96
97
# Old Faithful data with density estimates (grid format)
98
faithfuld: pandas.DataFrame
99
# Columns: eruptions, waiting, density
100
101
# Lake Huron water levels 1875-1972
102
huron: pandas.DataFrame
103
# Columns: year, level, decade
104
105
# Vector field of seal movements
106
seals: pandas.DataFrame
107
# Columns: lat, long, delta_long, delta_lat
108
```
109
110
### Food Production and Web Data
111
112
```python { .api }
113
# US meat production by month (millions of lbs)
114
meat: pandas.DataFrame
115
# Columns: date, beef, veal, pork, lamb_and_mutton, broilers, other_chicken, turkey
116
117
# Website pageview data
118
pageviews: pandas.DataFrame
119
# Columns: date, pageviews
120
```
121
122
### Political Data
123
124
```python { .api }
125
# Terms of 11 US presidents from Eisenhower to Obama
126
presidential: pandas.DataFrame
127
# Columns: name, start, end, party
128
```
129
130
### Statistical Datasets
131
132
```python { .api }
133
# Anscombe's Quartet - 4 datasets with identical statistical properties
134
anscombe_quartet: pandas.DataFrame
135
# Columns: dataset, x, y
136
137
# Colors in Luv color space
138
luv_colours: pandas.DataFrame
139
# Columns: L, u, v, col
140
```
141
142
## Common Usage Patterns
143
144
### Quick Data Exploration
145
```python
146
from plotnine import ggplot, aes, geom_histogram, geom_point, facet_wrap
147
from plotnine.data import diamonds, penguins
148
149
# Explore diamond prices
150
price_dist = (ggplot(diamonds, aes(x='price')) +
151
geom_histogram(bins=30) +
152
facet_wrap('cut'))
153
154
# Penguin species comparison
155
penguin_plot = (ggplot(penguins, aes(x='bill_length_mm', y='bill_depth_mm', color='species')) +
156
geom_point())
157
```
158
159
### Time Series Analysis
160
```python
161
from plotnine import ggplot, aes, geom_line
162
from plotnine.data import economics
163
164
# Economic trends over time
165
econ_plot = (ggplot(economics, aes(x='date', y='unemploy')) +
166
geom_line())
167
```
168
169
### Statistical Examples
170
```python
171
from plotnine import ggplot, aes, geom_point, stat_smooth
172
from plotnine.data import mtcars
173
174
# Regression analysis
175
regression_plot = (ggplot(mtcars, aes(x='wt', y='mpg')) +
176
geom_point() +
177
stat_smooth(method='lm'))
178
```
179
180
## Dataset Categories
181
182
| Category | Datasets | Use Cases |
183
|----------|----------|-----------|
184
| **Automotive** | mtcars, mpg | Regression, clustering, factor analysis |
185
| **Economics** | economics, economics_long, txhousing | Time series, trend analysis |
186
| **Biology** | penguins, msleep, faithful | Species comparison, behavioral analysis |
187
| **Geography** | midwest, seals, huron | Spatial analysis, movement patterns |
188
| **Retail** | diamonds | Price modeling, categorical analysis |
189
| **Food** | meat | Production trends, seasonal patterns |
190
| **Politics** | presidential | Timeline analysis, categorical data |
191
| **Statistics** | anscombe_quartet, luv_colours | Statistical education, color analysis |
192
193
All datasets are provided as pandas DataFrames with appropriate data types, including categorical variables where relevant for optimal plotting performance.