Tessl Tile for pypi/stable-baselines3@2.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

algorithms.md common-framework.md environments.md her.md index.md training-utilities.md

her.mddocs/

0
# Hindsight Experience Replay (HER)
1

2
Implementation of Hindsight Experience Replay for goal-conditioned reinforcement learning, enabling learning from failed attempts by treating them as successful attempts toward different goals. This approach dramatically improves sample efficiency in sparse reward environments.
3

4
## Capabilities
5

6
### HER Replay Buffer
7

8
Specialized replay buffer that implements the Hindsight Experience Replay algorithm by automatically generating additional training samples from failed episodes.
9

10
```python { .api }
11
class HerReplayBuffer(ReplayBuffer):
12
    """
13
    Replay buffer with Hindsight Experience Replay.
14
    
15
    Args:
16
        buffer_size: Maximum buffer capacity
17
        observation_space: Observation space (must include 'observation', 'achieved_goal', 'desired_goal')
18
        action_space: Action space
19
        env_info: Additional environment information
20
        device: PyTorch device placement
21
        n_envs: Number of parallel environments
22
        optimize_memory_usage: Enable memory optimizations
23
        handle_timeout_termination: Handle timeout terminations properly
24
        n_sampled_goal: Number of virtual transitions per real transition
25
        goal_selection_strategy: Strategy for selecting goals ("future", "final", "episode", "random")
26
        wrapped_env: Environment wrapper for HER
27
        online_sampling: Whether to sample goals online during training
28
        max_episode_length: Maximum episode length for buffer management
29
    """
30
    def __init__(
31
        self,
32
        buffer_size: int,
33
        observation_space: gym.spaces.Space,
34
        action_space: gym.spaces.Space,
35
        env_info: Optional[Dict[str, Any]] = None,
36
        device: Union[torch.device, str] = "auto",
37
        n_envs: int = 1,
38
        optimize_memory_usage: bool = False,
39
        handle_timeout_termination: bool = True,
40
        n_sampled_goal: int = 4,
41
        goal_selection_strategy: Union[GoalSelectionStrategy, str] = "future",
42
        wrapped_env: Optional[VecEnv] = None,
43
        online_sampling: bool = True,
44
        max_episode_length: Optional[int] = None,
45
    ): ...
46

47
    def add(
48
        self,
49
        obs: np.ndarray,
50
        next_obs: np.ndarray,
51
        actions: np.ndarray,
52
        rewards: np.ndarray,
53
        dones: np.ndarray,
54
        infos: List[Dict[str, Any]],
55
    ) -> None:
56
        """
57
        Add transition to replay buffer with HER.
58
        
59
        Args:
60
            obs: Current observations (dict with 'observation', 'achieved_goal', 'desired_goal')
61
            next_obs: Next observations
62
            actions: Actions taken
63
            rewards: Rewards received
64
            dones: Episode termination flags
65
            infos: Additional information from environment
66
        """
67

68
    def sample(self, batch_size: int, env: Optional[VecEnv] = None) -> ReplayBufferSamples:
69
        """
70
        Sample batch of transitions with hindsight goals.
71
        
72
        Args:
73
            batch_size: Number of transitions to sample
74
            env: Environment for computing rewards (if None, uses wrapped_env)
75
            
76
        Returns:
77
            Batch of experience samples with original and hindsight transitions
78
        """
79

80
    def _sample_goals(
81
        self,
82
        episode_transitions: List[Dict[str, np.ndarray]],
83
        transition_idx: int,
84
        n_sampled_goal: int,
85
    ) -> np.ndarray:
86
        """
87
        Sample goals for hindsight experience replay.
88
        
89
        Args:
90
            episode_transitions: List of transitions from episode
91
            transition_idx: Index of current transition
92
            n_sampled_goal: Number of goals to sample
93
            
94
        Returns:
95
            Array of sampled goals
96
        """
97

98
    def _store_episode(
99
        self,
100
        episode_transitions: List[Dict[str, np.ndarray]],
101
        is_success: bool,
102
    ) -> None:
103
        """
104
        Store episode transitions and generate HER samples.
105
        
106
        Args:
107
            episode_transitions: List of transitions from completed episode
108
            is_success: Whether episode was successful
109
        """
110

111
    def truncate_last_trajectory(self) -> None:
112
        """Truncate last incomplete trajectory from buffer."""
113
```
114

115
### Goal Selection Strategies
116

117
Different strategies for selecting which goals to use when creating hindsight experience, each with different trade-offs for learning efficiency.
118

119
```python { .api }
120
class GoalSelectionStrategy:
121
    """
122
    Enumeration of goal selection strategies for HER.
123
    
124
    Strategies:
125
        FUTURE: Sample goals from future states in the same episode
126
        FINAL: Use the final achieved goal from the episode
127
        EPISODE: Sample goals from any state in the episode
128
        RANDOM: Sample completely random goals
129
    """
130
    FUTURE = "future"
131
    FINAL = "final"
132
    EPISODE = "episode"
133
    RANDOM = "random"
134

135
KEY_TO_GOAL_STRATEGY: Dict[str, GoalSelectionStrategy] = {
136
    "future": GoalSelectionStrategy.FUTURE,
137
    "final": GoalSelectionStrategy.FINAL,
138
    "episode": GoalSelectionStrategy.EPISODE,
139
    "random": GoalSelectionStrategy.RANDOM,
140
}
141
```
142

143
### Environment Requirements
144

145
HER requires specific environment structure and interfaces to function properly with goal-conditioned learning.
146

147
```python { .api }
148
# Required observation space structure for HER
149
HER_OBSERVATION_SPACE = gym.spaces.Dict({
150
    'observation': gym.spaces.Box,  # Environment state
151
    'achieved_goal': gym.spaces.Box,  # Currently achieved goal
152
    'desired_goal': gym.spaces.Box,   # Desired goal for this episode
153
})
154

155
# Required info dict keys from environment
156
REQUIRED_INFO_KEYS = [
157
    'is_success',  # Boolean indicating if goal was achieved
158
]
159

160
# Optional info dict keys
161
OPTIONAL_INFO_KEYS = [
162
    'TimeLimit.truncated',  # Boolean indicating timeout termination
163
]
164
```
165

166
## Usage Examples
167

168
### Basic HER Setup with SAC
169

170
```python
171
import gymnasium as gym
172
from stable_baselines3 import SAC
173
from stable_baselines3.her import HerReplayBuffer
174
from stable_baselines3.common.vec_env import DummyVecEnv
175

176
# Create goal-conditioned environment (e.g., FetchReach-v1)
177
env = gym.make("FetchReach-v1")
178

179
# Verify environment has proper goal-conditioned structure
180
assert isinstance(env.observation_space, gym.spaces.Dict)
181
assert "observation" in env.observation_space.spaces
182
assert "achieved_goal" in env.observation_space.spaces  
183
assert "desired_goal" in env.observation_space.spaces
184

185
# Wrap in vectorized environment
186
env = DummyVecEnv([lambda: env])
187

188
# Configure SAC with HER
189
model = SAC(
190
    "MultiInputPolicy",  # Required for dict observations
191
    env,
192
    replay_buffer_class=HerReplayBuffer,
193
    replay_buffer_kwargs=dict(
194
        n_sampled_goal=4,
195
        goal_selection_strategy="future",
196
        online_sampling=True,
197
        max_episode_length=50,
198
    ),
199
    verbose=1
200
)
201

202
# Train the agent
203
model.learn(total_timesteps=100000)
204
```
205

206
### Advanced HER Configuration
207

208
```python
209
from stable_baselines3.her import GoalSelectionStrategy
210

211
# Custom HER buffer configuration
212
her_kwargs = dict(
213
    n_sampled_goal=8,  # More hindsight goals per transition
214
    goal_selection_strategy=GoalSelectionStrategy.FUTURE,
215
    online_sampling=True,
216
    max_episode_length=100,
217
    handle_timeout_termination=True,
218
    optimize_memory_usage=False,
219
)
220

221
# Use with TD3 (also works with DDPG, SAC)
222
from stable_baselines3 import TD3
223

224
model = TD3(
225
    "MultiInputPolicy",
226
    env,
227
    replay_buffer_class=HerReplayBuffer,
228
    replay_buffer_kwargs=her_kwargs,
229
    buffer_size=1000000,
230
    learning_starts=1000,
231
    batch_size=256,
232
    verbose=1
233
)
234

235
model.learn(total_timesteps=500000)
236
```
237

238
### Custom Goal-Conditioned Environment
239

240
```python
241
import numpy as np
242

243
class SimpleGoalEnv(gym.Env):
244
    """Simple goal-conditioned environment for HER demonstration."""
245
    
246
    def __init__(self):
247
        super().__init__()
248
        
249
        # Define spaces
250
        self.action_space = gym.spaces.Box(-1, 1, (2,), dtype=np.float32)
251
        
252
        # Goal-conditioned observation space
253
        self.observation_space = gym.spaces.Dict({
254
            'observation': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),
255
            'achieved_goal': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),
256
            'desired_goal': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),
257
        })
258
        
259
        self.goal_threshold = 0.1
260
        self.max_steps = 50
261
        
262
    def reset(self, seed=None, options=None):
263
        super().reset(seed=seed)
264
        
265
        # Random initial position
266
        self.position = self.np_random.uniform(-5, 5, (2,))
267
        
268
        # Random goal
269
        self.goal = self.np_random.uniform(-5, 5, (2,))
270
        
271
        self.step_count = 0
272
        
273
        obs = {
274
            'observation': self.position.copy(),
275
            'achieved_goal': self.position.copy(),
276
            'desired_goal': self.goal.copy(),
277
        }
278
        
279
        return obs, {}
280
    
281
    def step(self, action):
282
        # Move based on action
283
        self.position += action * 0.1
284
        self.position = np.clip(self.position, -5, 5)
285
        
286
        # Check if goal is achieved
287
        distance = np.linalg.norm(self.position - self.goal)
288
        is_success = distance < self.goal_threshold
289
        
290
        # Sparse reward: 0 for success, -1 otherwise
291
        reward = 0.0 if is_success else -1.0
292
        
293
        self.step_count += 1
294
        terminated = is_success
295
        truncated = self.step_count >= self.max_steps
296
        
297
        obs = {
298
            'observation': self.position.copy(),
299
            'achieved_goal': self.position.copy(),
300
            'desired_goal': self.goal.copy(),
301
        }
302
        
303
        info = {
304
            'is_success': is_success,
305
            'distance': distance,
306
        }
307
        
308
        return obs, reward, terminated, truncated, info
309
    
310
    def compute_reward(self, achieved_goal, desired_goal, info):
311
        """Compute reward for HER."""
312
        distance = np.linalg.norm(achieved_goal - desired_goal, axis=-1)
313
        return (distance < self.goal_threshold).astype(np.float32)
314

315
# Use custom environment with HER
316
custom_env = SimpleGoalEnv()
317
vec_env = DummyVecEnv([lambda: custom_env])
318

319
model = SAC(
320
    "MultiInputPolicy",
321
    vec_env,
322
    replay_buffer_class=HerReplayBuffer,
323
    replay_buffer_kwargs=dict(
324
        n_sampled_goal=4,
325
        goal_selection_strategy="future",
326
    ),
327
    verbose=1
328
)
329

330
model.learn(total_timesteps=50000)
331
```
332

333
### HER with Different Goal Selection Strategies
334

335
```python
336
# Compare different goal selection strategies
337
strategies = ["future", "final", "episode", "random"]
338
models = {}
339

340
for strategy in strategies:
341
    print(f"Training with {strategy} strategy...")
342
    
343
    env = DummyVecEnv([lambda: gym.make("FetchReach-v1")])
344
    
345
    model = SAC(
346
        "MultiInputPolicy",
347
        env,
348
        replay_buffer_class=HerReplayBuffer,
349
        replay_buffer_kwargs=dict(
350
            n_sampled_goal=4,
351
            goal_selection_strategy=strategy,
352
        ),
353
        verbose=0
354
    )
355
    
356
    model.learn(total_timesteps=25000)
357
    models[strategy] = model
358

359
# Evaluate performance
360
from stable_baselines3.common.evaluation import evaluate_policy
361

362
for strategy, model in models.items():
363
    mean_reward, std_reward = evaluate_policy(
364
        model, 
365
        env, 
366
        n_eval_episodes=20,
367
        deterministic=True
368
    )
369
    print(f"{strategy}: {mean_reward:.2f} ± {std_reward:.2f}")
370
```
371

372
### Monitoring HER Training
373

374
```python
375
from stable_baselines3.common.callbacks import BaseCallback
376
import numpy as np
377

378
class HERMonitorCallback(BaseCallback):
379
    """Custom callback to monitor HER training progress."""
380
    
381
    def __init__(self, eval_env, verbose=0):
382
        super().__init__(verbose)
383
        self.eval_env = eval_env
384
        self.success_rates = []
385
        
386
    def _on_step(self) -> bool:
387
        # Log HER-specific metrics every 1000 steps
388
        if self.n_calls % 1000 == 0:
389
            # Evaluate success rate
390
            n_eval_episodes = 10
391
            successes = 0
392
            
393
            for _ in range(n_eval_episodes):
394
                obs = self.eval_env.reset()
395
                done = False
396
                
397
                while not done:
398
                    action, _ = self.model.predict(obs, deterministic=True)
399
                    obs, reward, done, info = self.eval_env.step(action)
400
                    
401
                    if info.get('is_success', False):
402
                        successes += 1
403
                        break
404
            
405
            success_rate = successes / n_eval_episodes
406
            self.success_rates.append(success_rate)
407
            
408
            # Log to tensorboard
409
            self.logger.record("eval/success_rate", success_rate)
410
            self.logger.record("eval/mean_success_rate", np.mean(self.success_rates[-10:]))
411
            
412
        return True
413

414
# Use monitoring callback
415
eval_env = gym.make("FetchReach-v1")
416
monitor_callback = HERMonitorCallback(eval_env, verbose=1)
417

418
model.learn(total_timesteps=100000, callback=monitor_callback)
419
```
420

421
## Implementation Notes
422

423
### Environment Compatibility
424

425
For an environment to work with HER, it must:
426

427
1. **Observation Space**: Use `gym.spaces.Dict` with keys:
428
   - `'observation'`: The actual environment state
429
   - `'achieved_goal'`: Currently achieved goal
430
   - `'desired_goal'`: Target goal for the episode
431

432
2. **Info Dictionary**: Return `'is_success'` boolean in the info dict
433

434
3. **Reward Function**: Ideally implement `compute_reward(achieved_goal, desired_goal, info)` method for efficient reward computation
435

436
### Goal Selection Strategy Trade-offs
437

438
- **Future**: Most sample efficient, learns from states that could be achieved
439
- **Final**: Simple but less diverse, only uses episode end states  
440
- **Episode**: More diverse but potentially less focused
441
- **Random**: Highest diversity but may include irrelevant goals
442

443
### Performance Considerations
444

445
- HER significantly increases memory usage due to storing episode transitions
446
- `n_sampled_goal` controls the replay ratio - higher values improve learning but increase computation
447
- `online_sampling=True` is more memory efficient but slightly slower
448
- Works best with off-policy algorithms (SAC, TD3, DDPG, DQN)
449

450
## Types
451

452
```python { .api }
453
from typing import Union, Optional, Type, Callable, Dict, Any, List, Tuple
454
import numpy as np
455
import gymnasium as gym
456
from stable_baselines3.common.type_aliases import GymEnv, ReplayBufferSamples
457
from stable_baselines3.common.buffers import ReplayBuffer
458
from stable_baselines3.her.her_replay_buffer import HerReplayBuffer, GoalSelectionStrategy
459
from stable_baselines3.common.vec_env import VecEnv
460
from stable_baselines3.common.base_class import BaseAlgorithm
461
```

Version

Tile

Files

her.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

her.mddocs/