or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

algorithms.mdcommon-framework.mdenvironments.mdher.mdindex.mdtraining-utilities.md

her.mddocs/

0

# Hindsight Experience Replay (HER)

1

2

Implementation of Hindsight Experience Replay for goal-conditioned reinforcement learning, enabling learning from failed attempts by treating them as successful attempts toward different goals. This approach dramatically improves sample efficiency in sparse reward environments.

3

4

## Capabilities

5

6

### HER Replay Buffer

7

8

Specialized replay buffer that implements the Hindsight Experience Replay algorithm by automatically generating additional training samples from failed episodes.

9

10

```python { .api }

11

class HerReplayBuffer(ReplayBuffer):

12

"""

13

Replay buffer with Hindsight Experience Replay.

14

15

Args:

16

buffer_size: Maximum buffer capacity

17

observation_space: Observation space (must include 'observation', 'achieved_goal', 'desired_goal')

18

action_space: Action space

19

env_info: Additional environment information

20

device: PyTorch device placement

21

n_envs: Number of parallel environments

22

optimize_memory_usage: Enable memory optimizations

23

handle_timeout_termination: Handle timeout terminations properly

24

n_sampled_goal: Number of virtual transitions per real transition

25

goal_selection_strategy: Strategy for selecting goals ("future", "final", "episode", "random")

26

wrapped_env: Environment wrapper for HER

27

online_sampling: Whether to sample goals online during training

28

max_episode_length: Maximum episode length for buffer management

29

"""

30

def __init__(

31

self,

32

buffer_size: int,

33

observation_space: gym.spaces.Space,

34

action_space: gym.spaces.Space,

35

env_info: Optional[Dict[str, Any]] = None,

36

device: Union[torch.device, str] = "auto",

37

n_envs: int = 1,

38

optimize_memory_usage: bool = False,

39

handle_timeout_termination: bool = True,

40

n_sampled_goal: int = 4,

41

goal_selection_strategy: Union[GoalSelectionStrategy, str] = "future",

42

wrapped_env: Optional[VecEnv] = None,

43

online_sampling: bool = True,

44

max_episode_length: Optional[int] = None,

45

): ...

46

47

def add(

48

self,

49

obs: np.ndarray,

50

next_obs: np.ndarray,

51

actions: np.ndarray,

52

rewards: np.ndarray,

53

dones: np.ndarray,

54

infos: List[Dict[str, Any]],

55

) -> None:

56

"""

57

Add transition to replay buffer with HER.

58

59

Args:

60

obs: Current observations (dict with 'observation', 'achieved_goal', 'desired_goal')

61

next_obs: Next observations

62

actions: Actions taken

63

rewards: Rewards received

64

dones: Episode termination flags

65

infos: Additional information from environment

66

"""

67

68

def sample(self, batch_size: int, env: Optional[VecEnv] = None) -> ReplayBufferSamples:

69

"""

70

Sample batch of transitions with hindsight goals.

71

72

Args:

73

batch_size: Number of transitions to sample

74

env: Environment for computing rewards (if None, uses wrapped_env)

75

76

Returns:

77

Batch of experience samples with original and hindsight transitions

78

"""

79

80

def _sample_goals(

81

self,

82

episode_transitions: List[Dict[str, np.ndarray]],

83

transition_idx: int,

84

n_sampled_goal: int,

85

) -> np.ndarray:

86

"""

87

Sample goals for hindsight experience replay.

88

89

Args:

90

episode_transitions: List of transitions from episode

91

transition_idx: Index of current transition

92

n_sampled_goal: Number of goals to sample

93

94

Returns:

95

Array of sampled goals

96

"""

97

98

def _store_episode(

99

self,

100

episode_transitions: List[Dict[str, np.ndarray]],

101

is_success: bool,

102

) -> None:

103

"""

104

Store episode transitions and generate HER samples.

105

106

Args:

107

episode_transitions: List of transitions from completed episode

108

is_success: Whether episode was successful

109

"""

110

111

def truncate_last_trajectory(self) -> None:

112

"""Truncate last incomplete trajectory from buffer."""

113

```

114

115

### Goal Selection Strategies

116

117

Different strategies for selecting which goals to use when creating hindsight experience, each with different trade-offs for learning efficiency.

118

119

```python { .api }

120

class GoalSelectionStrategy:

121

"""

122

Enumeration of goal selection strategies for HER.

123

124

Strategies:

125

FUTURE: Sample goals from future states in the same episode

126

FINAL: Use the final achieved goal from the episode

127

EPISODE: Sample goals from any state in the episode

128

RANDOM: Sample completely random goals

129

"""

130

FUTURE = "future"

131

FINAL = "final"

132

EPISODE = "episode"

133

RANDOM = "random"

134

135

KEY_TO_GOAL_STRATEGY: Dict[str, GoalSelectionStrategy] = {

136

"future": GoalSelectionStrategy.FUTURE,

137

"final": GoalSelectionStrategy.FINAL,

138

"episode": GoalSelectionStrategy.EPISODE,

139

"random": GoalSelectionStrategy.RANDOM,

140

}

141

```

142

143

### Environment Requirements

144

145

HER requires specific environment structure and interfaces to function properly with goal-conditioned learning.

146

147

```python { .api }

148

# Required observation space structure for HER

149

HER_OBSERVATION_SPACE = gym.spaces.Dict({

150

'observation': gym.spaces.Box, # Environment state

151

'achieved_goal': gym.spaces.Box, # Currently achieved goal

152

'desired_goal': gym.spaces.Box, # Desired goal for this episode

153

})

154

155

# Required info dict keys from environment

156

REQUIRED_INFO_KEYS = [

157

'is_success', # Boolean indicating if goal was achieved

158

]

159

160

# Optional info dict keys

161

OPTIONAL_INFO_KEYS = [

162

'TimeLimit.truncated', # Boolean indicating timeout termination

163

]

164

```

165

166

## Usage Examples

167

168

### Basic HER Setup with SAC

169

170

```python

171

import gymnasium as gym

172

from stable_baselines3 import SAC

173

from stable_baselines3.her import HerReplayBuffer

174

from stable_baselines3.common.vec_env import DummyVecEnv

175

176

# Create goal-conditioned environment (e.g., FetchReach-v1)

177

env = gym.make("FetchReach-v1")

178

179

# Verify environment has proper goal-conditioned structure

180

assert isinstance(env.observation_space, gym.spaces.Dict)

181

assert "observation" in env.observation_space.spaces

182

assert "achieved_goal" in env.observation_space.spaces

183

assert "desired_goal" in env.observation_space.spaces

184

185

# Wrap in vectorized environment

186

env = DummyVecEnv([lambda: env])

187

188

# Configure SAC with HER

189

model = SAC(

190

"MultiInputPolicy", # Required for dict observations

191

env,

192

replay_buffer_class=HerReplayBuffer,

193

replay_buffer_kwargs=dict(

194

n_sampled_goal=4,

195

goal_selection_strategy="future",

196

online_sampling=True,

197

max_episode_length=50,

198

),

199

verbose=1

200

)

201

202

# Train the agent

203

model.learn(total_timesteps=100000)

204

```

205

206

### Advanced HER Configuration

207

208

```python

209

from stable_baselines3.her import GoalSelectionStrategy

210

211

# Custom HER buffer configuration

212

her_kwargs = dict(

213

n_sampled_goal=8, # More hindsight goals per transition

214

goal_selection_strategy=GoalSelectionStrategy.FUTURE,

215

online_sampling=True,

216

max_episode_length=100,

217

handle_timeout_termination=True,

218

optimize_memory_usage=False,

219

)

220

221

# Use with TD3 (also works with DDPG, SAC)

222

from stable_baselines3 import TD3

223

224

model = TD3(

225

"MultiInputPolicy",

226

env,

227

replay_buffer_class=HerReplayBuffer,

228

replay_buffer_kwargs=her_kwargs,

229

buffer_size=1000000,

230

learning_starts=1000,

231

batch_size=256,

232

verbose=1

233

)

234

235

model.learn(total_timesteps=500000)

236

```

237

238

### Custom Goal-Conditioned Environment

239

240

```python

241

import numpy as np

242

243

class SimpleGoalEnv(gym.Env):

244

"""Simple goal-conditioned environment for HER demonstration."""

245

246

def __init__(self):

247

super().__init__()

248

249

# Define spaces

250

self.action_space = gym.spaces.Box(-1, 1, (2,), dtype=np.float32)

251

252

# Goal-conditioned observation space

253

self.observation_space = gym.spaces.Dict({

254

'observation': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),

255

'achieved_goal': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),

256

'desired_goal': gym.spaces.Box(-5, 5, (2,), dtype=np.float32),

257

})

258

259

self.goal_threshold = 0.1

260

self.max_steps = 50

261

262

def reset(self, seed=None, options=None):

263

super().reset(seed=seed)

264

265

# Random initial position

266

self.position = self.np_random.uniform(-5, 5, (2,))

267

268

# Random goal

269

self.goal = self.np_random.uniform(-5, 5, (2,))

270

271

self.step_count = 0

272

273

obs = {

274

'observation': self.position.copy(),

275

'achieved_goal': self.position.copy(),

276

'desired_goal': self.goal.copy(),

277

}

278

279

return obs, {}

280

281

def step(self, action):

282

# Move based on action

283

self.position += action * 0.1

284

self.position = np.clip(self.position, -5, 5)

285

286

# Check if goal is achieved

287

distance = np.linalg.norm(self.position - self.goal)

288

is_success = distance < self.goal_threshold

289

290

# Sparse reward: 0 for success, -1 otherwise

291

reward = 0.0 if is_success else -1.0

292

293

self.step_count += 1

294

terminated = is_success

295

truncated = self.step_count >= self.max_steps

296

297

obs = {

298

'observation': self.position.copy(),

299

'achieved_goal': self.position.copy(),

300

'desired_goal': self.goal.copy(),

301

}

302

303

info = {

304

'is_success': is_success,

305

'distance': distance,

306

}

307

308

return obs, reward, terminated, truncated, info

309

310

def compute_reward(self, achieved_goal, desired_goal, info):

311

"""Compute reward for HER."""

312

distance = np.linalg.norm(achieved_goal - desired_goal, axis=-1)

313

return (distance < self.goal_threshold).astype(np.float32)

314

315

# Use custom environment with HER

316

custom_env = SimpleGoalEnv()

317

vec_env = DummyVecEnv([lambda: custom_env])

318

319

model = SAC(

320

"MultiInputPolicy",

321

vec_env,

322

replay_buffer_class=HerReplayBuffer,

323

replay_buffer_kwargs=dict(

324

n_sampled_goal=4,

325

goal_selection_strategy="future",

326

),

327

verbose=1

328

)

329

330

model.learn(total_timesteps=50000)

331

```

332

333

### HER with Different Goal Selection Strategies

334

335

```python

336

# Compare different goal selection strategies

337

strategies = ["future", "final", "episode", "random"]

338

models = {}

339

340

for strategy in strategies:

341

print(f"Training with {strategy} strategy...")

342

343

env = DummyVecEnv([lambda: gym.make("FetchReach-v1")])

344

345

model = SAC(

346

"MultiInputPolicy",

347

env,

348

replay_buffer_class=HerReplayBuffer,

349

replay_buffer_kwargs=dict(

350

n_sampled_goal=4,

351

goal_selection_strategy=strategy,

352

),

353

verbose=0

354

)

355

356

model.learn(total_timesteps=25000)

357

models[strategy] = model

358

359

# Evaluate performance

360

from stable_baselines3.common.evaluation import evaluate_policy

361

362

for strategy, model in models.items():

363

mean_reward, std_reward = evaluate_policy(

364

model,

365

env,

366

n_eval_episodes=20,

367

deterministic=True

368

)

369

print(f"{strategy}: {mean_reward:.2f} ± {std_reward:.2f}")

370

```

371

372

### Monitoring HER Training

373

374

```python

375

from stable_baselines3.common.callbacks import BaseCallback

376

import numpy as np

377

378

class HERMonitorCallback(BaseCallback):

379

"""Custom callback to monitor HER training progress."""

380

381

def __init__(self, eval_env, verbose=0):

382

super().__init__(verbose)

383

self.eval_env = eval_env

384

self.success_rates = []

385

386

def _on_step(self) -> bool:

387

# Log HER-specific metrics every 1000 steps

388

if self.n_calls % 1000 == 0:

389

# Evaluate success rate

390

n_eval_episodes = 10

391

successes = 0

392

393

for _ in range(n_eval_episodes):

394

obs = self.eval_env.reset()

395

done = False

396

397

while not done:

398

action, _ = self.model.predict(obs, deterministic=True)

399

obs, reward, done, info = self.eval_env.step(action)

400

401

if info.get('is_success', False):

402

successes += 1

403

break

404

405

success_rate = successes / n_eval_episodes

406

self.success_rates.append(success_rate)

407

408

# Log to tensorboard

409

self.logger.record("eval/success_rate", success_rate)

410

self.logger.record("eval/mean_success_rate", np.mean(self.success_rates[-10:]))

411

412

return True

413

414

# Use monitoring callback

415

eval_env = gym.make("FetchReach-v1")

416

monitor_callback = HERMonitorCallback(eval_env, verbose=1)

417

418

model.learn(total_timesteps=100000, callback=monitor_callback)

419

```

420

421

## Implementation Notes

422

423

### Environment Compatibility

424

425

For an environment to work with HER, it must:

426

427

1. **Observation Space**: Use `gym.spaces.Dict` with keys:

428

- `'observation'`: The actual environment state

429

- `'achieved_goal'`: Currently achieved goal

430

- `'desired_goal'`: Target goal for the episode

431

432

2. **Info Dictionary**: Return `'is_success'` boolean in the info dict

433

434

3. **Reward Function**: Ideally implement `compute_reward(achieved_goal, desired_goal, info)` method for efficient reward computation

435

436

### Goal Selection Strategy Trade-offs

437

438

- **Future**: Most sample efficient, learns from states that could be achieved

439

- **Final**: Simple but less diverse, only uses episode end states

440

- **Episode**: More diverse but potentially less focused

441

- **Random**: Highest diversity but may include irrelevant goals

442

443

### Performance Considerations

444

445

- HER significantly increases memory usage due to storing episode transitions

446

- `n_sampled_goal` controls the replay ratio - higher values improve learning but increase computation

447

- `online_sampling=True` is more memory efficient but slightly slower

448

- Works best with off-policy algorithms (SAC, TD3, DDPG, DQN)

449

450

## Types

451

452

```python { .api }

453

from typing import Union, Optional, Type, Callable, Dict, Any, List, Tuple

454

import numpy as np

455

import gymnasium as gym

456

from stable_baselines3.common.type_aliases import GymEnv, ReplayBufferSamples

457

from stable_baselines3.common.buffers import ReplayBuffer

458

from stable_baselines3.her.her_replay_buffer import HerReplayBuffer, GoalSelectionStrategy

459

from stable_baselines3.common.vec_env import VecEnv

460

from stable_baselines3.common.base_class import BaseAlgorithm

461

```