or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

command-line-tools.mdconfiguration.mddogstatsd-client.mderror-handling.mdhttp-api-client.mdindex.mdthreadstats.md

error-handling.mddocs/

0

# Error Handling

1

2

Comprehensive exception hierarchy and error handling strategies for managing API errors, network issues, authentication problems, and client-side errors with appropriate retry and recovery mechanisms.

3

4

## Capabilities

5

6

### Exception Hierarchy

7

8

Well-structured exception classes providing specific error information for different failure scenarios.

9

10

```python { .api }

11

class DatadogException(Exception):

12

"""

13

Base exception class for all Datadog-related errors.

14

15

Attributes:

16

- message (str): Error description

17

- code (int): HTTP status code (when applicable)

18

"""

19

20

class ApiError(DatadogException):

21

"""

22

API-specific errors including authentication failures and invalid requests.

23

24

Raised when:

25

- Invalid API key or application key

26

- Malformed API requests

27

- API rate limiting

28

- Resource not found

29

- Permission denied

30

"""

31

32

class ClientError(DatadogException):

33

"""

34

Client-side errors related to HTTP communication and network issues.

35

36

Base class for:

37

- Connection failures

38

- Timeout errors

39

- Proxy errors

40

- SSL/TLS errors

41

"""

42

43

class HttpTimeout(ClientError):

44

"""

45

Request timeout errors when API calls exceed configured timeout.

46

47

Raised when:

48

- API requests take longer than specified timeout

49

- Network latency causes delays

50

- Datadog API is experiencing high load

51

"""

52

53

class HttpBackoff(ClientError):

54

"""

55

Backoff errors indicating temporary API unavailability.

56

57

Raised when:

58

- API returns 5xx server errors

59

- Rate limiting triggers backoff

60

- Temporary service disruptions

61

"""

62

63

class HTTPError(ClientError):

64

"""

65

HTTP response errors for non-2xx status codes.

66

67

Attributes:

68

- status_code (int): HTTP status code

69

- response (object): Raw HTTP response object

70

71

Raised for:

72

- 400 Bad Request

73

- 401 Unauthorized

74

- 403 Forbidden

75

- 404 Not Found

76

- 429 Too Many Requests

77

- 5xx Server Errors

78

"""

79

80

class ProxyError(ClientError):

81

"""

82

Proxy connection and configuration errors.

83

84

Raised when:

85

- Proxy server is unreachable

86

- Proxy authentication fails

87

- Invalid proxy configuration

88

"""

89

90

class ApiNotInitialized(ApiError):

91

"""

92

Error when attempting API calls without proper initialization.

93

94

Raised when:

95

- API key not configured

96

- Application key not configured

97

- initialize() not called before API usage

98

"""

99

```

100

101

### Error Suppression Control

102

103

Configure error handling behavior through the mute parameter and global settings.

104

105

```python { .api }

106

# Global error suppression setting (configured via initialize())

107

# api._mute (bool): When True, suppresses ApiError and ClientError exceptions

108

109

# Error suppression affects:

110

# - API method calls (api.Event.create, api.Monitor.get, etc.)

111

# - HTTP client errors (timeouts, connection failures)

112

# - Authentication and authorization errors

113

114

# Errors still logged but not raised when mute=True

115

```

116

117

### StatsD Error Resilience

118

119

StatsD operations are designed to be fire-and-forget with built-in error resilience.

120

121

```python { .api }

122

# StatsD error handling characteristics:

123

# - UDP transport failures are silently ignored

124

# - Socket errors don't interrupt application flow

125

# - Network issues don't block metric submission

126

# - Malformed metrics are dropped without errors

127

128

# StatsD errors that may occur:

129

# - Socket creation failures

130

# - DNS resolution errors for statsd_host

131

# - Permission errors for Unix Domain Sockets

132

# - Network unreachable errors

133

```

134

135

## Usage Examples

136

137

### Basic Error Handling

138

139

```python

140

from datadog import initialize, api

141

from datadog.api.exceptions import ApiError, ClientError, ApiNotInitialized

142

143

# Configure with error suppression disabled for explicit handling

144

initialize(

145

api_key="your-api-key",

146

app_key="your-app-key",

147

mute=False # Enable explicit error handling

148

)

149

150

try:

151

# API call that might fail

152

monitor = api.Monitor.create(

153

type="metric alert",

154

query="avg(last_5m):avg:system.cpu.user{*} > 80",

155

name="High CPU usage"

156

)

157

print(f"Monitor created with ID: {monitor['id']}")

158

159

except ApiNotInitialized:

160

print("ERROR: Datadog not properly initialized")

161

162

except ApiError as e:

163

print(f"API Error: {e}")

164

# Handle authentication, permission, or API-specific errors

165

166

except ClientError as e:

167

print(f"Client Error: {e}")

168

# Handle network, timeout, or connection errors

169

170

except Exception as e:

171

print(f"Unexpected error: {e}")

172

```

173

174

### Specific Exception Handling

175

176

```python

177

from datadog import api

178

from datadog.api.exceptions import HttpTimeout, HTTPError, ApiError

179

180

def create_monitor_with_retry(monitor_config, max_retries=3):

181

"""Create monitor with retry logic for different error types."""

182

183

for attempt in range(max_retries):

184

try:

185

return api.Monitor.create(**monitor_config)

186

187

except HttpTimeout:

188

if attempt < max_retries - 1:

189

print(f"Timeout on attempt {attempt + 1}, retrying...")

190

time.sleep(2 ** attempt) # Exponential backoff

191

continue

192

else:

193

print("Failed after maximum timeout retries")

194

raise

195

196

except HTTPError as e:

197

if e.status_code == 429: # Rate limiting

198

if attempt < max_retries - 1:

199

print("Rate limited, waiting before retry...")

200

time.sleep(60) # Wait 1 minute for rate limit reset

201

continue

202

elif e.status_code >= 500: # Server errors

203

if attempt < max_retries - 1:

204

print(f"Server error {e.status_code}, retrying...")

205

time.sleep(5)

206

continue

207

print(f"HTTP Error {e.status_code}: {e}")

208

raise

209

210

except ApiError as e:

211

# Don't retry authentication or permission errors

212

print(f"API Error (not retryable): {e}")

213

raise

214

215

# Usage

216

monitor_config = {

217

"type": "metric alert",

218

"query": "avg(last_5m):avg:system.cpu.user{*} > 80",

219

"name": "High CPU usage"

220

}

221

222

try:

223

monitor = create_monitor_with_retry(monitor_config)

224

print(f"Monitor created: {monitor['id']}")

225

except Exception as e:

226

print(f"Failed to create monitor: {e}")

227

```

228

229

### Error Handling with Raw Response Access

230

231

```python

232

from datadog import initialize, api

233

from datadog.api.exceptions import HTTPError

234

235

# Configure to include raw HTTP responses

236

initialize(

237

api_key="your-api-key",

238

app_key="your-app-key",

239

return_raw_response=True,

240

mute=False

241

)

242

243

try:

244

result = api.Event.create(

245

title="Test Event",

246

text="Testing error handling"

247

)

248

249

# With return_raw_response=True, result includes:

250

# - Decoded response data

251

# - Raw HTTP response object

252

print(f"Event created: {result[0]['event']['id']}")

253

print(f"Status code: {result[1].status_code}")

254

print(f"Response headers: {result[1].headers}")

255

256

except HTTPError as e:

257

print(f"HTTP Status: {e.status_code}")

258

print(f"Response body: {e.response.text}")

259

print(f"Request headers: {e.response.request.headers}")

260

261

# Handle specific HTTP status codes

262

if e.status_code == 400:

263

print("Bad request - check your parameters")

264

elif e.status_code == 401:

265

print("Unauthorized - check your API key")

266

elif e.status_code == 403:

267

print("Forbidden - check your permissions")

268

elif e.status_code == 404:

269

print("Resource not found")

270

```

271

272

### Graceful Degradation Pattern

273

274

```python

275

from datadog import api, statsd

276

from datadog.api.exceptions import DatadogException

277

import logging

278

279

logger = logging.getLogger(__name__)

280

281

def submit_metrics_with_fallback(metrics_data):

282

"""Submit metrics with graceful degradation."""

283

284

# Primary: Try API submission for persistent metrics

285

try:

286

api.Metric.send(**metrics_data)

287

logger.info("Metrics submitted via API")

288

return True

289

290

except DatadogException as e:

291

logger.warning(f"API submission failed: {e}")

292

293

# Fallback: Use StatsD for real-time metrics

294

try:

295

statsd.gauge(

296

metrics_data['metric'],

297

metrics_data['points'][-1][1], # Latest value

298

tags=metrics_data.get('tags', [])

299

)

300

logger.info("Metrics submitted via StatsD fallback")

301

return True

302

303

except Exception as e:

304

logger.error(f"StatsD fallback failed: {e}")

305

return False

306

307

def create_monitor_with_fallback(monitor_config):

308

"""Create monitor with fallback to simplified configuration."""

309

310

try:

311

# Try creating monitor with full configuration

312

return api.Monitor.create(**monitor_config)

313

314

except DatadogException as e:

315

logger.warning(f"Full monitor creation failed: {e}")

316

317

# Fallback: Create simplified monitor

318

simplified_config = {

319

'type': monitor_config['type'],

320

'query': monitor_config['query'],

321

'name': f"[Simplified] {monitor_config['name']}"

322

}

323

324

try:

325

return api.Monitor.create(**simplified_config)

326

except DatadogException as e:

327

logger.error(f"Simplified monitor creation failed: {e}")

328

raise

329

```

330

331

### Circuit Breaker Pattern

332

333

```python

334

from datadog import api

335

from datadog.api.exceptions import DatadogException

336

import time

337

from threading import Lock

338

339

class DatadogCircuitBreaker:

340

"""Circuit breaker for Datadog API calls."""

341

342

def __init__(self, failure_threshold=5, recovery_timeout=60):

343

self.failure_threshold = failure_threshold

344

self.recovery_timeout = recovery_timeout

345

self.failure_count = 0

346

self.last_failure_time = None

347

self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN

348

self.lock = Lock()

349

350

def call(self, func, *args, **kwargs):

351

"""Execute function with circuit breaker protection."""

352

353

with self.lock:

354

if self.state == 'OPEN':

355

if time.time() - self.last_failure_time > self.recovery_timeout:

356

self.state = 'HALF_OPEN'

357

else:

358

raise DatadogException("Circuit breaker is OPEN")

359

360

try:

361

result = func(*args, **kwargs)

362

363

with self.lock:

364

# Success resets failure count

365

self.failure_count = 0

366

if self.state == 'HALF_OPEN':

367

self.state = 'CLOSED'

368

369

return result

370

371

except DatadogException as e:

372

with self.lock:

373

self.failure_count += 1

374

self.last_failure_time = time.time()

375

376

if self.failure_count >= self.failure_threshold:

377

self.state = 'OPEN'

378

379

raise

380

381

# Usage

382

circuit_breaker = DatadogCircuitBreaker()

383

384

def safe_api_call(func, *args, **kwargs):

385

"""Make API call with circuit breaker protection."""

386

try:

387

return circuit_breaker.call(func, *args, **kwargs)

388

except DatadogException as e:

389

print(f"API call failed (circuit breaker): {e}")

390

return None

391

392

# Protected API calls

393

event = safe_api_call(

394

api.Event.create,

395

title="Test Event",

396

text="Circuit breaker test"

397

)

398

399

monitors = safe_api_call(api.Monitor.get_all)

400

```

401

402

### Comprehensive Error Logging

403

404

```python

405

from datadog import initialize, api

406

from datadog.api.exceptions import *

407

import logging

408

import traceback

409

410

# Configure logging

411

logging.basicConfig(level=logging.INFO)

412

logger = logging.getLogger(__name__)

413

414

# Initialize with error suppression disabled

415

initialize(

416

api_key="your-api-key",

417

app_key="your-app-key",

418

mute=False,

419

return_raw_response=True

420

)

421

422

def log_datadog_error(operation, exception, **context):

423

"""Comprehensive error logging for Datadog operations."""

424

425

error_details = {

426

'operation': operation,

427

'exception_type': type(exception).__name__,

428

'error_message': str(exception),

429

'context': context

430

}

431

432

if isinstance(exception, HTTPError):

433

error_details.update({

434

'status_code': exception.status_code,

435

'response_body': getattr(exception.response, 'text', 'N/A'),

436

'request_url': getattr(exception.response.request, 'url', 'N/A'),

437

'request_method': getattr(exception.response.request, 'method', 'N/A')

438

})

439

440

if isinstance(exception, (HttpTimeout, HttpBackoff)):

441

error_details['retry_recommended'] = True

442

443

logger.error(f"Datadog operation failed: {error_details}")

444

445

# Log full traceback for debugging

446

logger.debug(f"Full traceback: {traceback.format_exc()}")

447

448

def robust_datadog_operation(operation_func, operation_name, **kwargs):

449

"""Execute Datadog operation with comprehensive error handling."""

450

451

try:

452

result = operation_func(**kwargs)

453

logger.info(f"Datadog operation succeeded: {operation_name}")

454

return result

455

456

except ApiNotInitialized as e:

457

log_datadog_error(operation_name, e, **kwargs)

458

raise # Re-raise as this is a configuration issue

459

460

except HttpTimeout as e:

461

log_datadog_error(operation_name, e, **kwargs)

462

# Could implement retry logic here

463

raise

464

465

except HTTPError as e:

466

log_datadog_error(operation_name, e, **kwargs)

467

468

if e.status_code == 401:

469

logger.critical("Authentication failed - check API keys")

470

elif e.status_code == 403:

471

logger.critical("Authorization failed - check permissions")

472

elif e.status_code == 429:

473

logger.warning("Rate limited - implement backoff")

474

elif e.status_code >= 500:

475

logger.warning("Server error - may be temporary")

476

477

raise

478

479

except ApiError as e:

480

log_datadog_error(operation_name, e, **kwargs)

481

raise

482

483

except ClientError as e:

484

log_datadog_error(operation_name, e, **kwargs)

485

raise

486

487

except Exception as e:

488

log_datadog_error(operation_name, e, **kwargs)

489

logger.error(f"Unexpected error in Datadog operation: {e}")

490

raise

491

492

# Usage examples

493

try:

494

monitor = robust_datadog_operation(

495

api.Monitor.create,

496

"create_monitor",

497

type="metric alert",

498

query="avg(last_5m):avg:system.cpu.user{*} > 80",

499

name="High CPU usage"

500

)

501

except Exception:

502

print("Monitor creation failed - check logs")

503

504

try:

505

events = robust_datadog_operation(

506

api.Event.query,

507

"query_events",

508

start=1234567890,

509

end=1234567899

510

)

511

except Exception:

512

print("Event query failed - check logs")

513

```

514

515

### StatsD Error Resilience Patterns

516

517

```python

518

from datadog import statsd

519

import logging

520

import socket

521

522

logger = logging.getLogger(__name__)

523

524

def resilient_statsd_submit(metric_name, value, **kwargs):

525

"""Submit StatsD metric with error resilience."""

526

527

try:

528

statsd.gauge(metric_name, value, **kwargs)

529

return True

530

531

except socket.error as e:

532

logger.warning(f"StatsD socket error: {e}")

533

# StatsD errors shouldn't block application

534

return False

535

536

except Exception as e:

537

logger.warning(f"Unexpected StatsD error: {e}")

538

return False

539

540

def batch_statsd_with_recovery(metrics_batch):

541

"""Submit batch of StatsD metrics with individual error recovery."""

542

543

success_count = 0

544

545

for metric in metrics_batch:

546

try:

547

if metric['type'] == 'gauge':

548

statsd.gauge(metric['name'], metric['value'], tags=metric.get('tags'))

549

elif metric['type'] == 'increment':

550

statsd.increment(metric['name'], metric['value'], tags=metric.get('tags'))

551

elif metric['type'] == 'timing':

552

statsd.timing(metric['name'], metric['value'], tags=metric.get('tags'))

553

554

success_count += 1

555

556

except Exception as e:

557

logger.warning(f"Failed to submit metric {metric['name']}: {e}")

558

# Continue with remaining metrics

559

continue

560

561

logger.info(f"Submitted {success_count}/{len(metrics_batch)} metrics successfully")

562

return success_count

563

564

# Usage

565

metrics = [

566

{'type': 'gauge', 'name': 'system.cpu.usage', 'value': 75.0, 'tags': ['host:web01']},

567

{'type': 'increment', 'name': 'web.requests', 'value': 1, 'tags': ['endpoint:/api']},

568

{'type': 'timing', 'name': 'db.query.time', 'value': 150, 'tags': ['table:users']}

569

]

570

571

batch_statsd_with_recovery(metrics)

572

```

573

574

## Error Handling Best Practices

575

576

### Appropriate Error Suppression

577

578

```python

579

# Production: Suppress errors to prevent application crashes

580

initialize(

581

api_key=os.environ['DATADOG_API_KEY'],

582

app_key=os.environ['DATADOG_APP_KEY'],

583

mute=True # Suppress exceptions in production

584

)

585

586

# Development: Enable errors for debugging

587

initialize(

588

api_key="dev-api-key",

589

app_key="dev-app-key",

590

mute=False # Show all errors during development

591

)

592

```

593

594

### Monitoring and Alerting Resilience

595

596

```python

597

# Critical monitoring should not fail application

598

def submit_critical_metric(metric_name, value):

599

try:

600

statsd.gauge(metric_name, value)

601

except:

602

# Never let metrics submission crash critical application flow

603

pass

604

605

# Non-critical operations can have explicit error handling

606

def create_dashboard_with_handling(dashboard_config):

607

try:

608

return api.Dashboard.create(**dashboard_config)

609

except DatadogException as e:

610

logger.error(f"Dashboard creation failed: {e}")

611

return None # Graceful degradation

612

```

613

614

### Retry Strategy Guidelines

615

616

```python

617

# Retry on transient errors

618

RETRYABLE_ERRORS = (HttpTimeout, HttpBackoff)

619

620

# Don't retry on permanent errors

621

NON_RETRYABLE_ERRORS = (ApiNotInitialized,)

622

623

# Conditional retry on HTTP errors

624

def should_retry_http_error(http_error):

625

return http_error.status_code in [429, 500, 502, 503, 504]

626

```