0
# Data Utilities
1
2
Utilities for creating hierarchical data structures from bottom-level time series data. These functions handle aggregation across multiple dimensions, create summing matrices, and prepare data in the format required by hierarchical reconciliation methods.
3
4
## Capabilities
5
6
### Cross-sectional Aggregation
7
8
Main function for creating hierarchical structures from bottom-level time series by aggregating across categorical dimensions.
9
10
```python { .api }
11
def aggregate(
12
df: Frame,
13
spec: list[list[str]],
14
exog_vars: Optional[dict[str, Union[str, list[str]]]] = None,
15
sparse_s: bool = False,
16
id_col: str = 'unique_id',
17
time_col: str = 'ds',
18
id_time_col: Optional[str] = None,
19
target_cols: Sequence[str] = ('y',)
20
) -> tuple[FrameT, FrameT, dict]:
21
"""
22
Create hierarchical structure from bottom-level time series.
23
24
Parameters:
25
- df: DataFrame with bottom-level time series data
26
Must contain id_col, time_col, and target_cols
27
- spec: list of aggregation specifications
28
Each inner list defines groupings for that level
29
Example: [['A', 'B'], ['C', 'D']] creates two aggregation levels
30
- exog_vars: dict mapping exogenous variable names to aggregation functions
31
Example: {'price': 'mean', 'volume': 'sum'}
32
- sparse_s: bool, whether to return sparse summing matrix for memory efficiency
33
- id_col: str, name of unique identifier column
34
- time_col: str, name of time column
35
- id_time_col: str, temporal hierarchy identifier (for temporal aggregation)
36
- target_cols: tuple of target variable column names
37
38
Returns:
39
- Y_df: DataFrame with hierarchically structured series
40
- S_df: DataFrame representation of summing matrix (or sparse matrix if sparse_s=True)
41
- tags: dict mapping hierarchy level names to series indices
42
"""
43
```
44
45
### Temporal Aggregation
46
47
Function for creating temporal hierarchies by aggregating time series at different frequencies.
48
49
```python { .api }
50
def aggregate_temporal(
51
df: Frame,
52
spec: dict[str, int],
53
exog_vars: Optional[dict[str, Union[str, list[str]]]] = None,
54
sparse_s: bool = False,
55
id_col: str = 'unique_id',
56
time_col: str = 'ds',
57
id_time_col: str = 'temporal_id',
58
target_cols: Sequence[str] = ('y',),
59
aggregation_type: str = 'local'
60
) -> tuple[FrameT, FrameT, dict]:
61
"""
62
Create temporal hierarchy from time series data.
63
64
Parameters:
65
- df: DataFrame with time series data at base frequency
66
- spec: dict mapping temporal level names to aggregation frequencies
67
Example: {'Monthly': 12, 'Quarterly': 4, 'Annual': 1}
68
- exog_vars: dict of exogenous variables and their aggregation functions
69
- sparse_s: bool, return sparse summing matrix
70
- id_col: str, unique identifier column name
71
- time_col: str, time column name
72
- id_time_col: str, temporal hierarchy identifier column name
73
- target_cols: tuple of target variable names
74
- aggregation_type: str, type of temporal aggregation ('local' or 'global')
75
76
Returns:
77
- Y_df: DataFrame with temporal hierarchy
78
- S_df: Temporal summing matrix
79
- tags: dict mapping temporal levels to indices
80
"""
81
```
82
83
### Future Dataframe Creation
84
85
Utility for creating future timestamp dataframes for forecasting.
86
87
```python { .api }
88
def make_future_dataframe(
89
df: Frame,
90
freq: Union[str, int],
91
h: int,
92
id_col: str = 'unique_id',
93
time_col: str = 'ds'
94
) -> FrameT:
95
"""
96
Create dataframe with future timestamps for forecasting.
97
98
Parameters:
99
- df: DataFrame with historical time series data
100
- freq: str, frequency string (e.g., 'D', 'M', 'Q', 'Y')
101
- h: int, forecast horizon (number of periods ahead)
102
- id_col: str, unique identifier column name
103
- time_col: str, time column name
104
105
Returns:
106
DataFrame with future timestamps for each series
107
"""
108
```
109
110
### Cross-Temporal Tags
111
112
Function for generating tags that combine cross-sectional and temporal hierarchies.
113
114
```python { .api }
115
def get_cross_temporal_tags(
116
df: pd.DataFrame,
117
tags_cs: dict,
118
tags_te: dict,
119
sep: str = '//',
120
id_col: str = 'unique_id',
121
id_time_col: str = 'temporal_id',
122
cross_temporal_id_col: str = 'cross_temporal_id'
123
) -> tuple[pd.DataFrame, dict]:
124
"""
125
Generate cross-temporal hierarchy tags.
126
127
Parameters:
128
- df: DataFrame with cross-temporal data
129
- tags_cs: dict with cross-sectional hierarchy tags
130
- tags_te: dict with temporal hierarchy tags
131
- sep: str, separator for combining cross-sectional and temporal identifiers
132
- id_col: str, cross-sectional identifier column
133
- id_time_col: str, temporal identifier column
134
- cross_temporal_id_col: str, combined identifier column name
135
136
Returns:
137
- Updated DataFrame with cross-temporal identifiers
138
- Combined tags dictionary for cross-temporal hierarchy
139
"""
140
```
141
142
### Hierarchy Structure Validation
143
144
Utility function to check if a hierarchy structure is strictly hierarchical.
145
146
```python { .api }
147
def is_strictly_hierarchical(S: pd.DataFrame, tags: dict) -> bool:
148
"""
149
Check if hierarchy structure is strictly hierarchical.
150
151
Parameters:
152
- S: summing matrix DataFrame
153
- tags: hierarchy tags dictionary
154
155
Returns:
156
bool indicating whether structure is strictly hierarchical
157
"""
158
```
159
160
## Usage Examples
161
162
### Basic Cross-sectional Aggregation
163
164
```python
165
import pandas as pd
166
from hierarchicalforecast.utils import aggregate
167
168
# Bottom-level data
169
df = pd.DataFrame({
170
'unique_id': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'],
171
'ds': pd.date_range('2020-01-01', periods=2, freq='D').tolist() * 4,
172
'y': [100, 110, 200, 220, 150, 160, 180, 190],
173
'category': ['X', 'X', 'X', 'X', 'Y', 'Y', 'Y', 'Y'],
174
'region': ['North', 'North', 'North', 'North', 'South', 'South', 'South', 'South']
175
})
176
177
# Define hierarchy specification
178
spec = [
179
['A', 'B', 'C', 'D'], # Bottom level (no aggregation)
180
['category'], # Aggregate by category
181
['region'], # Aggregate by region
182
]
183
184
# Create hierarchical structure
185
Y_df, S_df, tags = aggregate(df, spec)
186
187
print("Hierarchical series:")
188
print(Y_df.head())
189
print("\nHierarchy tags:")
190
print(tags)
191
```
192
193
### Temporal Aggregation
194
195
```python
196
from hierarchicalforecast.utils import aggregate_temporal
197
198
# Daily data to be aggregated temporally
199
daily_df = pd.DataFrame({
200
'unique_id': ['series1'] * 365,
201
'ds': pd.date_range('2020-01-01', periods=365, freq='D'),
202
'y': np.random.randn(365).cumsum() + 100
203
})
204
205
# Define temporal aggregation specification
206
temporal_spec = {
207
'Daily': 1, # Base frequency
208
'Weekly': 7, # Aggregate every 7 days
209
'Monthly': 30, # Aggregate every 30 days
210
'Quarterly': 90 # Aggregate every 90 days
211
}
212
213
# Create temporal hierarchy
214
Y_temporal, S_temporal, tags_temporal = aggregate_temporal(
215
daily_df,
216
temporal_spec
217
)
218
```
219
220
### Aggregation with Exogenous Variables
221
222
```python
223
# Data with exogenous variables
224
df_with_exog = pd.DataFrame({
225
'unique_id': ['A', 'A', 'B', 'B'],
226
'ds': pd.date_range('2020-01-01', periods=2, freq='D').tolist() * 2,
227
'y': [100, 110, 200, 220],
228
'price': [10.5, 10.8, 12.0, 12.3],
229
'volume': [1000, 1100, 2000, 2200]
230
})
231
232
# Specify how to aggregate exogenous variables
233
exog_aggregation = {
234
'price': 'mean', # Average price across aggregated series
235
'volume': 'sum' # Sum volume across aggregated series
236
}
237
238
spec = [['A', 'B']] # Simple aggregation
239
240
Y_df, S_df, tags = aggregate(
241
df_with_exog,
242
spec,
243
exog_vars=exog_aggregation
244
)
245
```
246
247
### Large Hierarchy with Sparse Matrix
248
249
```python
250
# For very large hierarchies, use sparse matrices
251
Y_df_sparse, S_sparse, tags_sparse = aggregate(
252
large_dataset,
253
complex_spec,
254
sparse_s=True # Returns scipy.sparse matrix for S
255
)
256
257
# S_sparse will be a scipy sparse matrix instead of DataFrame
258
print(f"Sparse matrix shape: {S_sparse.shape}")
259
print(f"Non-zero elements: {S_sparse.nnz}")
260
```
261
262
### Creating Future Dataframes
263
264
```python
265
from hierarchicalforecast.utils import make_future_dataframe
266
267
# Create future timestamps for forecasting
268
future_df = make_future_dataframe(
269
df=historical_data,
270
freq='D', # Daily frequency
271
h=30, # 30 days ahead
272
id_col='unique_id',
273
time_col='ds'
274
)
275
276
print("Future timestamps:")
277
print(future_df.head())
278
```
279
280
### Combined Cross-sectional and Temporal Hierarchies
281
282
```python
283
from hierarchicalforecast.utils import get_cross_temporal_tags
284
285
# First create cross-sectional hierarchy
286
Y_cs, S_cs, tags_cs = aggregate(df, cross_sectional_spec)
287
288
# Then create temporal hierarchy
289
Y_te, S_te, tags_te = aggregate_temporal(Y_cs, temporal_spec)
290
291
# Combine them
292
Y_cross_temp, tags_cross_temp = get_cross_temporal_tags(
293
df=Y_te,
294
tags_cs=tags_cs,
295
tags_te=tags_te,
296
sep='//'
297
)
298
```
299
300
### Validation
301
302
```python
303
from hierarchicalforecast.utils import is_strictly_hierarchical
304
305
# Check if hierarchy is strictly hierarchical
306
is_strict = is_strictly_hierarchical(S_df, tags)
307
print(f"Strictly hierarchical: {is_strict}")
308
```
309
310
## Output Utility Functions
311
312
Utility functions for converting prediction intervals and samples to different output formats.
313
314
```python { .api }
315
def level_to_outputs(level: list[int]) -> list[str]:
316
"""
317
Convert confidence levels to output column names.
318
319
Parameters:
320
- level: list of confidence levels (e.g., [80, 95])
321
322
Returns:
323
List of column name strings for low and high bounds
324
"""
325
326
def quantiles_to_outputs(quantiles: list[float]) -> list[str]:
327
"""
328
Convert quantiles to output column names.
329
330
Parameters:
331
- quantiles: list of quantile levels (e.g., [0.1, 0.5, 0.9])
332
333
Returns:
334
List of quantile column name strings
335
"""
336
337
def samples_to_quantiles_df(
338
samples: np.ndarray,
339
unique_ids: list,
340
dates: list,
341
quantiles: list[float],
342
id_col: str = 'unique_id',
343
time_col: str = 'ds'
344
) -> pd.DataFrame:
345
"""
346
Transform samples array to quantile DataFrame.
347
348
Parameters:
349
- samples: array of forecast samples
350
- unique_ids: list of series identifiers
351
- dates: list of forecast dates
352
- quantiles: list of quantile levels to compute
353
- id_col: identifier column name
354
- time_col: time column name
355
356
Returns:
357
DataFrame with quantile columns
358
"""
359
```