0
# Built-in Datasets
1
2
Sample datasets for learning and experimentation with plotly visualizations. The data module provides 10+ commonly used datasets in data science, returned as pandas DataFrames (or other backends if configured).
3
4
## Capabilities
5
6
### Classification and Clustering Datasets
7
8
Classic datasets for machine learning and statistical analysis.
9
10
```python { .api }
11
def iris():
12
"""
13
Load the Iris flower dataset.
14
15
Contains measurements of iris flowers from three species: setosa, versicolor, and virginica.
16
Each sample has four features: sepal length, sepal width, petal length, and petal width.
17
18
Returns:
19
DataFrame: 150 rows × 5 columns
20
- sepal_length: float, sepal length in cm
21
- sepal_width: float, sepal width in cm
22
- petal_length: float, petal length in cm
23
- petal_width: float, petal width in cm
24
- species: str, flower species ('setosa', 'versicolor', 'virginica')
25
- species_id: int, numeric species identifier (0, 1, 2)
26
"""
27
28
def tips():
29
"""
30
Load restaurant tips dataset.
31
32
Contains information about restaurant bills, tips, and customer characteristics.
33
Useful for exploring relationships between categorical and continuous variables.
34
35
Returns:
36
DataFrame: 244 rows × 7 columns
37
- total_bill: float, total bill amount in dollars
38
- tip: float, tip amount in dollars
39
- sex: str, customer gender ('Male', 'Female')
40
- smoker: str, smoking status ('Yes', 'No')
41
- day: str, day of week ('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
42
- time: str, meal time ('Lunch', 'Dinner')
43
- size: int, party size (number of people)
44
"""
45
```
46
47
### Economic and Demographic Data
48
49
Datasets containing economic indicators and demographic information over time.
50
51
```python { .api }
52
def gapminder():
53
"""
54
Load Gapminder world development dataset.
55
56
Contains country-level data on life expectancy, GDP per capita, and population
57
from 1952 to 2007. Excellent for demonstrating animated visualizations and
58
geographic mapping.
59
60
Returns:
61
DataFrame: 1704 rows × 8 columns
62
- country: str, country name
63
- continent: str, continent name ('Africa', 'Americas', 'Asia', 'Europe', 'Oceania')
64
- year: int, year (1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007)
65
- lifeExp: float, life expectancy in years
66
- pop: int, population count
67
- gdpPercap: float, GDP per capita in US dollars
68
- iso_alpha: str, 3-letter ISO country code
69
- iso_num: int, numeric ISO country code
70
"""
71
72
def medals_wide():
73
"""
74
Load Olympic medals dataset in wide format.
75
76
Contains medal counts by country for 2018 Winter Olympics, with separate
77
columns for each medal type.
78
79
Returns:
80
DataFrame: 30 rows × 4 columns
81
- nation: str, country name
82
- gold: int, number of gold medals
83
- silver: int, number of silver medals
84
- bronze: int, number of bronze medals
85
"""
86
87
def medals_long():
88
"""
89
Load Olympic medals dataset in long format.
90
91
Same data as medals_wide but in tidy/long format with medal type as a variable.
92
93
Returns:
94
DataFrame: 90 rows × 3 columns
95
- nation: str, country name
96
- medal: str, medal type ('gold', 'silver', 'bronze')
97
- count: int, number of medals of that type
98
"""
99
```
100
101
### Time Series and Financial Data
102
103
Datasets with temporal components for time series analysis and visualization.
104
105
```python { .api }
106
def stocks():
107
"""
108
Load stock price dataset.
109
110
Contains daily stock prices for major technology companies (AAPL, GOOGL, AMZN, FB, NFLX, MSFT)
111
from 2018-2020. Useful for financial charts and time series analysis.
112
113
Returns:
114
DataFrame: 1560 rows × 3 columns
115
- date: datetime, trading date
116
- AAPL: float, Apple stock price
117
- GOOGL: float, Google stock price
118
- AMZN: float, Amazon stock price
119
- FB: float, Facebook stock price
120
- NFLX: float, Netflix stock price
121
- MSFT: float, Microsoft stock price
122
"""
123
124
def flights():
125
"""
126
Load airline passenger flights dataset.
127
128
Contains monthly passenger counts for different airlines and airports.
129
Good for demonstrating time series patterns and seasonal trends.
130
131
Returns:
132
DataFrame: 5733 rows × 4 columns
133
- year: int, year
134
- month: int, month (1-12)
135
- passengers: int, number of passengers
136
- airline: str, airline identifier
137
"""
138
```
139
140
### Election and Political Data
141
142
Datasets containing electoral and political information.
143
144
```python { .api }
145
def election():
146
"""
147
Load 2013 Montreal mayoral election results.
148
149
Contains voting results by district with candidate vote shares and
150
geographic information for choropleth mapping.
151
152
Returns:
153
DataFrame: 58 rows × 15 columns
154
- district: int, electoral district number
155
- Coderre: float, vote percentage for Denis Coderre
156
- Bergeron: float, vote percentage for Richard Bergeron
157
- Joly: float, vote percentage for Mélanie Joly
158
- total: int, total votes cast
159
- winner: str, winning candidate name
160
- result: str, result type ('win', 'lose')
161
- district_id: int, district identifier for mapping
162
- ... additional demographic columns
163
"""
164
165
def election_geojson():
166
"""
167
Load GeoJSON data for Montreal election districts.
168
169
Geographic boundary data corresponding to the election dataset,
170
used for creating choropleth maps.
171
172
Returns:
173
dict: GeoJSON feature collection with district boundaries
174
"""
175
```
176
177
### Scientific and Environmental Data
178
179
Datasets from scientific measurements and environmental monitoring.
180
181
```python { .api }
182
def wind():
183
"""
184
Load wind measurement dataset.
185
186
Contains wind speed and direction measurements, useful for polar plots,
187
wind roses, and meteorological visualizations.
188
189
Returns:
190
DataFrame: 128 rows × 4 columns
191
- direction: str, wind direction ('N', 'NE', 'E', 'SE', 'S', 'SW', 'W', 'NW')
192
- strength: str, wind strength category ('0-1', '1-2', '2-3', '3-4', '4-4+', '4-5', '5-6', '6+')
193
- frequency: float, frequency of occurrence
194
- magnitude: float, magnitude value for polar plotting
195
"""
196
197
def carshare():
198
"""
199
Load car sharing usage dataset.
200
201
Contains information about car sharing service usage patterns,
202
including temporal and geographic distribution.
203
204
Returns:
205
DataFrame: 249 rows × 4 columns
206
- centroid_lat: float, latitude of service area centroid
207
- centroid_lon: float, longitude of service area centroid
208
- car_hours: float, total car usage hours
209
- member_birth_year: int, birth year of member
210
"""
211
```
212
213
### Experimental and A/B Testing Data
214
215
Datasets designed for statistical analysis and experimental design examples.
216
217
```python { .api }
218
def experiment():
219
"""
220
Load A/B testing experiment dataset.
221
222
Contains results from a controlled experiment with treatment and control groups,
223
useful for demonstrating statistical analysis and hypothesis testing.
224
225
Returns:
226
DataFrame: 100 rows × 4 columns
227
- experiment_1: int, first experiment result
228
- experiment_2: int, second experiment result
229
- experiment_3: int, third experiment result
230
- group: str, experimental group ('control', 'treatment')
231
"""
232
```
233
234
## Usage Examples
235
236
```python
237
import plotly.express as px
238
import plotly.data as data
239
240
# Load and explore iris dataset
241
df_iris = data.iris()
242
print(df_iris.head())
243
print(df_iris.info())
244
245
# Create scatter plot with iris data
246
fig1 = px.scatter(df_iris, x="sepal_width", y="sepal_length",
247
color="species", size="petal_length",
248
title="Iris Dataset Visualization")
249
fig1.show()
250
251
# Load gapminder for animated visualization
252
df_gap = data.gapminder()
253
fig2 = px.scatter(df_gap, x="gdpPercap", y="lifeExp",
254
animation_frame="year", animation_group="country",
255
size="pop", color="continent", hover_name="country",
256
log_x=True, size_max=55, range_x=[100,100000],
257
range_y=[25,90], title="Gapminder Animation")
258
fig2.show()
259
260
# Stock price time series
261
df_stocks = data.stocks()
262
fig3 = px.line(df_stocks, x="date", y=["AAPL", "GOOGL", "AMZN"],
263
title="Tech Stock Prices")
264
fig3.show()
265
266
# Tips dataset for statistical analysis
267
df_tips = data.tips()
268
fig4 = px.box(df_tips, x="day", y="total_bill", color="time",
269
title="Restaurant Bills by Day and Time")
270
fig4.show()
271
272
# Wind data for polar visualization
273
df_wind = data.wind()
274
fig5 = px.bar_polar(df_wind, r="frequency", theta="direction",
275
color="strength", template="plotly_dark",
276
color_discrete_sequence=px.colors.sequential.Plasma_r,
277
title="Wind Pattern Analysis")
278
fig5.show()
279
280
# Election data for choropleth mapping
281
df_election = data.election()
282
geojson = data.election_geojson()
283
fig6 = px.choropleth(df_election, geojson=geojson, locations="district",
284
color="winner",
285
hover_data=["Coderre", "Bergeron", "Joly"],
286
title="Montreal Election Results")
287
fig6.show()
288
289
# Car sharing geographic analysis
290
df_cars = data.carshare()
291
fig7 = px.scatter_mapbox(df_cars, lat="centroid_lat", lon="centroid_lon",
292
size="car_hours", color="member_birth_year",
293
hover_data=["car_hours"], zoom=10, height=600,
294
mapbox_style="open-street-map",
295
title="Car Sharing Usage Patterns")
296
fig7.show()
297
298
# Olympic medals comparison
299
df_medals = data.medals_long()
300
fig8 = px.bar(df_medals, x="nation", y="count", color="medal",
301
title="2018 Winter Olympics Medal Count")
302
fig8.show()
303
304
# Flight passenger trends
305
df_flights = data.flights()
306
fig9 = px.line(df_flights, x="month", y="passengers", color="airline",
307
title="Airline Passenger Trends")
308
fig9.show()
309
310
# A/B testing results
311
df_experiment = data.experiment()
312
fig10 = px.box(df_experiment, y=["experiment_1", "experiment_2", "experiment_3"],
313
color="group", title="A/B Testing Results")
314
fig10.show()
315
316
# Dataset information summary
317
datasets = [
318
('iris', data.iris),
319
('tips', data.tips),
320
('gapminder', data.gapminder),
321
('stocks', data.stocks),
322
('flights', data.flights),
323
('wind', data.wind),
324
('election', data.election),
325
('carshare', data.carshare),
326
('medals_long', data.medals_long),
327
('experiment', data.experiment)
328
]
329
330
for name, func in datasets:
331
df = func()
332
print(f"{name}: {df.shape[0]} rows, {df.shape[1]} columns")
333
```