0
# Hindsight Experience Replay (HER)
1
2
Implementation of Hindsight Experience Replay for goal-conditioned reinforcement learning, enabling learning from failed attempts by treating them as successful attempts toward different goals. This approach dramatically improves sample efficiency in sparse reward environments.
3
4
## Capabilities
5
6
### HER Replay Buffer
7
8
Specialized replay buffer that implements the Hindsight Experience Replay algorithm by automatically generating additional training samples from failed episodes.
9
10
```python { .api }
11
class HerReplayBuffer(ReplayBuffer):
12
"""
13
Replay buffer with Hindsight Experience Replay.
14
15
Args:
16
buffer_size: Maximum buffer capacity
17
observation_space: Observation space (must include 'observation', 'achieved_goal', 'desired_goal')
18
action_space: Action space
19
env_info: Additional environment information
20
device: PyTorch device placement
21
n_envs: Number of parallel environments
22
optimize_memory_usage: Enable memory optimizations
23
handle_timeout_termination: Handle timeout terminations properly
24
n_sampled_goal: Number of virtual transitions per real transition
25
goal_selection_strategy: Strategy for selecting goals ("future", "final", "episode", "random")
26
wrapped_env: Environment wrapper for HER
27
online_sampling: Whether to sample goals online during training
28
max_episode_length: Maximum episode length for buffer management
29
"""
30
def __init__(
31
self,
32
buffer_size: int,
33
observation_space: gym.spaces.Space,
34
action_space: gym.spaces.Space,
35
env_info: Optional[Dict[str, Any]] = None,
36
device: Union[torch.device, str] = "auto",
37
n_envs: int = 1,
38
optimize_memory_usage: bool = False,
39
handle_timeout_termination: bool = True,
40
n_sampled_goal: int = 4,
41
goal_selection_strategy: Union[GoalSelectionStrategy, str] = "future",
42
wrapped_env: Optional[VecEnv] = None,
43
online_sampling: bool = True,
44
max_episode_length: Optional[int] = None,
45
): ...
46
47
def add(
48
self,
49
obs: np.ndarray,
50
next_obs: np.ndarray,
51
actions: np.ndarray,
52
rewards: np.ndarray,
53
dones: np.ndarray,
54
infos: List[Dict[str, Any]],
55
) -> None:
56
"""
57
Add transition to replay buffer with HER.
58
59
Args:
60
obs: Current observations (dict with 'observation', 'achieved_goal', 'desired_goal')
61
next_obs: Next observations
62
actions: Actions taken
63
rewards: Rewards received
64
dones: Episode termination flags
65
infos: Additional information from environment
66
"""
67
68
def sample(self, batch_size: int, env: Optional[VecEnv] = None) -> ReplayBufferSamples:
69
"""
70
Sample batch of transitions with hindsight goals.
71
72
Args:
73
batch_size: Number of transitions to sample
74
env: Environment for computing rewards (if None, uses wrapped_env)
75
76
Returns:
77
Batch of experience samples with original and hindsight transitions
78
"""
79
80
def _sample_goals(
81
self,
82
episode_transitions: List[Dict[str, np.ndarray]],
83
transition_idx: int,
84
n_sampled_goal: int,
85
) -> np.ndarray:
86
"""
87
Sample goals for hindsight experience replay.
88
89
Args:
90
episode_transitions: List of transitions from episode
91
transition_idx: Index of current transition
92
n_sampled_goal: Number of goals to sample
93
94
Returns:
95
Array of sampled goals
96
"""
97
98
def _store_episode(
99
self,
100
episode_transitions: List[Dict[str, np.ndarray]],
101
is_success: bool,
102
) -> None:
103
"""
104
Store episode transitions and generate HER samples.
105
106
Args:
107
episode_transitions: List of transitions from completed episode
108
is_success: Whether episode was successful
109
"""
110
111
def truncate_last_trajectory(self) -> None:
112
"""Truncate last incomplete trajectory from buffer."""
113
```
114
115
### Goal Selection Strategies
116
117
Different strategies for selecting which goals to use when creating hindsight experience, each with different trade-offs for learning efficiency.
118
119
```python { .api }
120
class GoalSelectionStrategy:
121
"""
122
Enumeration of goal selection strategies for HER.
123
124
Strategies:
125
FUTURE: Sample goals from future states in the same episode
126
FINAL: Use the final achieved goal from the episode
127
EPISODE: Sample goals from any state in the episode
128
RANDOM: Sample completely random goals
129
"""
130
FUTURE = "future"
131
FINAL = "final"
132
EPISODE = "episode"
133
RANDOM = "random"
134
135
KEY_TO_GOAL_STRATEGY: Dict[str, GoalSelectionStrategy] = {
136
"future": GoalSelectionStrategy.FUTURE,
137
"final": GoalSelectionStrategy.FINAL,
138
"episode": GoalSelectionStrategy.EPISODE,
139
"random": GoalSelectionStrategy.RANDOM,
140
}
141
```
142
143
### Environment Requirements
144
145
HER requires specific environment structure and interfaces to function properly with goal-conditioned learning.
146
147
```python { .api }
148
# Required observation space structure for HER
149
HER_OBSERVATION_SPACE = gym.spaces.Dict({
150
'observation': gym.spaces.Box, # Environment state
151
'achieved_goal': gym.spaces.Box, # Currently achieved goal
152
'desired_goal': gym.spaces.Box, # Desired goal for this episode
153
})
154
155
# Required info dict keys from environment
156
REQUIRED_INFO_KEYS = [
157
'is_success', # Boolean indicating if goal was achieved
158
]
159
160
# Optional info dict keys
161
OPTIONAL_INFO_KEYS = [
162
'TimeLimit.truncated', # Boolean indicating timeout termination
163
]
164
```
165
166
## Usage Examples
167
168
### Basic HER Setup with SAC
169
170
```python
171
import gymnasium as gym
172
from stable_baselines3 import SAC
173
from stable_baselines3.her import HerReplayBuffer
174
from stable_baselines3.common.vec_env import DummyVecEnv
175
176
# Create goal-conditioned environment (e.g., FetchReach-v1)
177
env = gym.make("FetchReach-v1")
178
179
# Verify environment has proper goal-conditioned structure
180
assert isinstance(env.observation_space, gym.spaces.Dict)
181
assert "observation" in env.observation_space.spaces
182
assert "achieved_goal" in env.observation_space.spaces
183
assert "desired_goal" in env.observation_space.spaces
184
185
# Wrap in vectorized environment
186
env = DummyVecEnv([lambda: env])
187
188
# Configure SAC with HER
189
model = SAC(
190
"MultiInputPolicy", # Required for dict observations
191
env,
192
replay_buffer_class=HerReplayBuffer,
193
replay_buffer_kwargs=dict(
194
n_sampled_goal=4,
195
goal_selection_strategy="future",
196
online_sampling=True,
197
max_episode_length=50,
198
),
199
verbose=1
200
)
201
202
# Train the agent
203
model.learn(total_timesteps=100000)
204
```
205
206
### Advanced HER Configuration
207
208
```python
209
from stable_baselines3.her import GoalSelectionStrategy
210
211
# Custom HER buffer configuration
212
her_kwargs = dict(
213
n_sampled_goal=8, # More hindsight goals per transition
214
goal_selection_strategy=GoalSelectionStrategy.FUTURE,
215
online_sampling=True,
216
max_episode_length=100,
217
handle_timeout_termination=True,
218
optimize_memory_usage=False,
219
)
220
221
# Use with TD3 (also works with DDPG, SAC)
222
from stable_baselines3 import TD3
223
224
model = TD3(
225
"MultiInputPolicy",
226
env,
227
replay_buffer_class=HerReplayBuffer,
228
replay_buffer_kwargs=her_kwargs,
229
buffer_size=1000000,
230
learning_starts=1000,
231
batch_size=256,
232
verbose=1
233
)
234
235
model.learn(total_timesteps=500000)
236
```
237
238
### Custom Goal-Conditioned Environment
239
240
```python
241
import numpy as np
242
243
class SimpleGoalEnv(gym.Env):
244
"""Simple goal-conditioned environment for HER demonstration."""
245
246
def __init__(self):
247
super().__init__()
248
249
# Define spaces
250
self.action_space = gym.spaces.Box(-1, 1, (2,), dtype=np.float32)
251
252
# Goal-conditioned observation space
253
self.observation_space = gym.spaces.Dict({
254
'observation': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),
255
'achieved_goal': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),
256
'desired_goal': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),
257
})
258
259
self.goal_threshold = 0.1
260
self.max_steps = 50
261
262
def reset(self, seed=None, options=None):
263
super().reset(seed=seed)
264
265
# Random initial position
266
self.position = self.np_random.uniform(-5, 5, (2,))
267
268
# Random goal
269
self.goal = self.np_random.uniform(-5, 5, (2,))
270
271
self.step_count = 0
272
273
obs = {
274
'observation': self.position.copy(),
275
'achieved_goal': self.position.copy(),
276
'desired_goal': self.goal.copy(),
277
}
278
279
return obs, {}
280
281
def step(self, action):
282
# Move based on action
283
self.position += action * 0.1
284
self.position = np.clip(self.position, -5, 5)
285
286
# Check if goal is achieved
287
distance = np.linalg.norm(self.position - self.goal)
288
is_success = distance < self.goal_threshold
289
290
# Sparse reward: 0 for success, -1 otherwise
291
reward = 0.0 if is_success else -1.0
292
293
self.step_count += 1
294
terminated = is_success
295
truncated = self.step_count >= self.max_steps
296
297
obs = {
298
'observation': self.position.copy(),
299
'achieved_goal': self.position.copy(),
300
'desired_goal': self.goal.copy(),
301
}
302
303
info = {
304
'is_success': is_success,
305
'distance': distance,
306
}
307
308
return obs, reward, terminated, truncated, info
309
310
def compute_reward(self, achieved_goal, desired_goal, info):
311
"""Compute reward for HER."""
312
distance = np.linalg.norm(achieved_goal - desired_goal, axis=-1)
313
return (distance < self.goal_threshold).astype(np.float32)
314
315
# Use custom environment with HER
316
custom_env = SimpleGoalEnv()
317
vec_env = DummyVecEnv([lambda: custom_env])
318
319
model = SAC(
320
"MultiInputPolicy",
321
vec_env,
322
replay_buffer_class=HerReplayBuffer,
323
replay_buffer_kwargs=dict(
324
n_sampled_goal=4,
325
goal_selection_strategy="future",
326
),
327
verbose=1
328
)
329
330
model.learn(total_timesteps=50000)
331
```
332
333
### HER with Different Goal Selection Strategies
334
335
```python
336
# Compare different goal selection strategies
337
strategies = ["future", "final", "episode", "random"]
338
models = {}
339
340
for strategy in strategies:
341
print(f"Training with {strategy} strategy...")
342
343
env = DummyVecEnv([lambda: gym.make("FetchReach-v1")])
344
345
model = SAC(
346
"MultiInputPolicy",
347
env,
348
replay_buffer_class=HerReplayBuffer,
349
replay_buffer_kwargs=dict(
350
n_sampled_goal=4,
351
goal_selection_strategy=strategy,
352
),
353
verbose=0
354
)
355
356
model.learn(total_timesteps=25000)
357
models[strategy] = model
358
359
# Evaluate performance
360
from stable_baselines3.common.evaluation import evaluate_policy
361
362
for strategy, model in models.items():
363
mean_reward, std_reward = evaluate_policy(
364
model,
365
env,
366
n_eval_episodes=20,
367
deterministic=True
368
)
369
print(f"{strategy}: {mean_reward:.2f} ± {std_reward:.2f}")
370
```
371
372
### Monitoring HER Training
373
374
```python
375
from stable_baselines3.common.callbacks import BaseCallback
376
import numpy as np
377
378
class HERMonitorCallback(BaseCallback):
379
"""Custom callback to monitor HER training progress."""
380
381
def __init__(self, eval_env, verbose=0):
382
super().__init__(verbose)
383
self.eval_env = eval_env
384
self.success_rates = []
385
386
def _on_step(self) -> bool:
387
# Log HER-specific metrics every 1000 steps
388
if self.n_calls % 1000 == 0:
389
# Evaluate success rate
390
n_eval_episodes = 10
391
successes = 0
392
393
for _ in range(n_eval_episodes):
394
obs = self.eval_env.reset()
395
done = False
396
397
while not done:
398
action, _ = self.model.predict(obs, deterministic=True)
399
obs, reward, done, info = self.eval_env.step(action)
400
401
if info.get('is_success', False):
402
successes += 1
403
break
404
405
success_rate = successes / n_eval_episodes
406
self.success_rates.append(success_rate)
407
408
# Log to tensorboard
409
self.logger.record("eval/success_rate", success_rate)
410
self.logger.record("eval/mean_success_rate", np.mean(self.success_rates[-10:]))
411
412
return True
413
414
# Use monitoring callback
415
eval_env = gym.make("FetchReach-v1")
416
monitor_callback = HERMonitorCallback(eval_env, verbose=1)
417
418
model.learn(total_timesteps=100000, callback=monitor_callback)
419
```
420
421
## Implementation Notes
422
423
### Environment Compatibility
424
425
For an environment to work with HER, it must:
426
427
1. **Observation Space**: Use `gym.spaces.Dict` with keys:
428
- `'observation'`: The actual environment state
429
- `'achieved_goal'`: Currently achieved goal
430
- `'desired_goal'`: Target goal for the episode
431
432
2. **Info Dictionary**: Return `'is_success'` boolean in the info dict
433
434
3. **Reward Function**: Ideally implement `compute_reward(achieved_goal, desired_goal, info)` method for efficient reward computation
435
436
### Goal Selection Strategy Trade-offs
437
438
- **Future**: Most sample efficient, learns from states that could be achieved
439
- **Final**: Simple but less diverse, only uses episode end states
440
- **Episode**: More diverse but potentially less focused
441
- **Random**: Highest diversity but may include irrelevant goals
442
443
### Performance Considerations
444
445
- HER significantly increases memory usage due to storing episode transitions
446
- `n_sampled_goal` controls the replay ratio - higher values improve learning but increase computation
447
- `online_sampling=True` is more memory efficient but slightly slower
448
- Works best with off-policy algorithms (SAC, TD3, DDPG, DQN)
449
450
## Types
451
452
```python { .api }
453
from typing import Union, Optional, Type, Callable, Dict, Any, List, Tuple
454
import numpy as np
455
import gymnasium as gym
456
from stable_baselines3.common.type_aliases import GymEnv, ReplayBufferSamples
457
from stable_baselines3.common.buffers import ReplayBuffer
458
from stable_baselines3.her.her_replay_buffer import HerReplayBuffer, GoalSelectionStrategy
459
from stable_baselines3.common.vec_env import VecEnv
460
from stable_baselines3.common.base_class import BaseAlgorithm
461
```