0
# Error Handling
1
2
Comprehensive exception hierarchy and error handling strategies for managing API errors, network issues, authentication problems, and client-side errors with appropriate retry and recovery mechanisms.
3
4
## Capabilities
5
6
### Exception Hierarchy
7
8
Well-structured exception classes providing specific error information for different failure scenarios.
9
10
```python { .api }
11
class DatadogException(Exception):
12
"""
13
Base exception class for all Datadog-related errors.
14
15
Attributes:
16
- message (str): Error description
17
- code (int): HTTP status code (when applicable)
18
"""
19
20
class ApiError(DatadogException):
21
"""
22
API-specific errors including authentication failures and invalid requests.
23
24
Raised when:
25
- Invalid API key or application key
26
- Malformed API requests
27
- API rate limiting
28
- Resource not found
29
- Permission denied
30
"""
31
32
class ClientError(DatadogException):
33
"""
34
Client-side errors related to HTTP communication and network issues.
35
36
Base class for:
37
- Connection failures
38
- Timeout errors
39
- Proxy errors
40
- SSL/TLS errors
41
"""
42
43
class HttpTimeout(ClientError):
44
"""
45
Request timeout errors when API calls exceed configured timeout.
46
47
Raised when:
48
- API requests take longer than specified timeout
49
- Network latency causes delays
50
- Datadog API is experiencing high load
51
"""
52
53
class HttpBackoff(ClientError):
54
"""
55
Backoff errors indicating temporary API unavailability.
56
57
Raised when:
58
- API returns 5xx server errors
59
- Rate limiting triggers backoff
60
- Temporary service disruptions
61
"""
62
63
class HTTPError(ClientError):
64
"""
65
HTTP response errors for non-2xx status codes.
66
67
Attributes:
68
- status_code (int): HTTP status code
69
- response (object): Raw HTTP response object
70
71
Raised for:
72
- 400 Bad Request
73
- 401 Unauthorized
74
- 403 Forbidden
75
- 404 Not Found
76
- 429 Too Many Requests
77
- 5xx Server Errors
78
"""
79
80
class ProxyError(ClientError):
81
"""
82
Proxy connection and configuration errors.
83
84
Raised when:
85
- Proxy server is unreachable
86
- Proxy authentication fails
87
- Invalid proxy configuration
88
"""
89
90
class ApiNotInitialized(ApiError):
91
"""
92
Error when attempting API calls without proper initialization.
93
94
Raised when:
95
- API key not configured
96
- Application key not configured
97
- initialize() not called before API usage
98
"""
99
```
100
101
### Error Suppression Control
102
103
Configure error handling behavior through the mute parameter and global settings.
104
105
```python { .api }
106
# Global error suppression setting (configured via initialize())
107
# api._mute (bool): When True, suppresses ApiError and ClientError exceptions
108
109
# Error suppression affects:
110
# - API method calls (api.Event.create, api.Monitor.get, etc.)
111
# - HTTP client errors (timeouts, connection failures)
112
# - Authentication and authorization errors
113
114
# Errors still logged but not raised when mute=True
115
```
116
117
### StatsD Error Resilience
118
119
StatsD operations are designed to be fire-and-forget with built-in error resilience.
120
121
```python { .api }
122
# StatsD error handling characteristics:
123
# - UDP transport failures are silently ignored
124
# - Socket errors don't interrupt application flow
125
# - Network issues don't block metric submission
126
# - Malformed metrics are dropped without errors
127
128
# StatsD errors that may occur:
129
# - Socket creation failures
130
# - DNS resolution errors for statsd_host
131
# - Permission errors for Unix Domain Sockets
132
# - Network unreachable errors
133
```
134
135
## Usage Examples
136
137
### Basic Error Handling
138
139
```python
140
from datadog import initialize, api
141
from datadog.api.exceptions import ApiError, ClientError, ApiNotInitialized
142
143
# Configure with error suppression disabled for explicit handling
144
initialize(
145
api_key="your-api-key",
146
app_key="your-app-key",
147
mute=False # Enable explicit error handling
148
)
149
150
try:
151
# API call that might fail
152
monitor = api.Monitor.create(
153
type="metric alert",
154
query="avg(last_5m):avg:system.cpu.user{*} > 80",
155
name="High CPU usage"
156
)
157
print(f"Monitor created with ID: {monitor['id']}")
158
159
except ApiNotInitialized:
160
print("ERROR: Datadog not properly initialized")
161
162
except ApiError as e:
163
print(f"API Error: {e}")
164
# Handle authentication, permission, or API-specific errors
165
166
except ClientError as e:
167
print(f"Client Error: {e}")
168
# Handle network, timeout, or connection errors
169
170
except Exception as e:
171
print(f"Unexpected error: {e}")
172
```
173
174
### Specific Exception Handling
175
176
```python
177
from datadog import api
178
from datadog.api.exceptions import HttpTimeout, HTTPError, ApiError
179
180
def create_monitor_with_retry(monitor_config, max_retries=3):
181
"""Create monitor with retry logic for different error types."""
182
183
for attempt in range(max_retries):
184
try:
185
return api.Monitor.create(**monitor_config)
186
187
except HttpTimeout:
188
if attempt < max_retries - 1:
189
print(f"Timeout on attempt {attempt + 1}, retrying...")
190
time.sleep(2 ** attempt) # Exponential backoff
191
continue
192
else:
193
print("Failed after maximum timeout retries")
194
raise
195
196
except HTTPError as e:
197
if e.status_code == 429: # Rate limiting
198
if attempt < max_retries - 1:
199
print("Rate limited, waiting before retry...")
200
time.sleep(60) # Wait 1 minute for rate limit reset
201
continue
202
elif e.status_code >= 500: # Server errors
203
if attempt < max_retries - 1:
204
print(f"Server error {e.status_code}, retrying...")
205
time.sleep(5)
206
continue
207
print(f"HTTP Error {e.status_code}: {e}")
208
raise
209
210
except ApiError as e:
211
# Don't retry authentication or permission errors
212
print(f"API Error (not retryable): {e}")
213
raise
214
215
# Usage
216
monitor_config = {
217
"type": "metric alert",
218
"query": "avg(last_5m):avg:system.cpu.user{*} > 80",
219
"name": "High CPU usage"
220
}
221
222
try:
223
monitor = create_monitor_with_retry(monitor_config)
224
print(f"Monitor created: {monitor['id']}")
225
except Exception as e:
226
print(f"Failed to create monitor: {e}")
227
```
228
229
### Error Handling with Raw Response Access
230
231
```python
232
from datadog import initialize, api
233
from datadog.api.exceptions import HTTPError
234
235
# Configure to include raw HTTP responses
236
initialize(
237
api_key="your-api-key",
238
app_key="your-app-key",
239
return_raw_response=True,
240
mute=False
241
)
242
243
try:
244
result = api.Event.create(
245
title="Test Event",
246
text="Testing error handling"
247
)
248
249
# With return_raw_response=True, result includes:
250
# - Decoded response data
251
# - Raw HTTP response object
252
print(f"Event created: {result[0]['event']['id']}")
253
print(f"Status code: {result[1].status_code}")
254
print(f"Response headers: {result[1].headers}")
255
256
except HTTPError as e:
257
print(f"HTTP Status: {e.status_code}")
258
print(f"Response body: {e.response.text}")
259
print(f"Request headers: {e.response.request.headers}")
260
261
# Handle specific HTTP status codes
262
if e.status_code == 400:
263
print("Bad request - check your parameters")
264
elif e.status_code == 401:
265
print("Unauthorized - check your API key")
266
elif e.status_code == 403:
267
print("Forbidden - check your permissions")
268
elif e.status_code == 404:
269
print("Resource not found")
270
```
271
272
### Graceful Degradation Pattern
273
274
```python
275
from datadog import api, statsd
276
from datadog.api.exceptions import DatadogException
277
import logging
278
279
logger = logging.getLogger(__name__)
280
281
def submit_metrics_with_fallback(metrics_data):
282
"""Submit metrics with graceful degradation."""
283
284
# Primary: Try API submission for persistent metrics
285
try:
286
api.Metric.send(**metrics_data)
287
logger.info("Metrics submitted via API")
288
return True
289
290
except DatadogException as e:
291
logger.warning(f"API submission failed: {e}")
292
293
# Fallback: Use StatsD for real-time metrics
294
try:
295
statsd.gauge(
296
metrics_data['metric'],
297
metrics_data['points'][-1][1], # Latest value
298
tags=metrics_data.get('tags', [])
299
)
300
logger.info("Metrics submitted via StatsD fallback")
301
return True
302
303
except Exception as e:
304
logger.error(f"StatsD fallback failed: {e}")
305
return False
306
307
def create_monitor_with_fallback(monitor_config):
308
"""Create monitor with fallback to simplified configuration."""
309
310
try:
311
# Try creating monitor with full configuration
312
return api.Monitor.create(**monitor_config)
313
314
except DatadogException as e:
315
logger.warning(f"Full monitor creation failed: {e}")
316
317
# Fallback: Create simplified monitor
318
simplified_config = {
319
'type': monitor_config['type'],
320
'query': monitor_config['query'],
321
'name': f"[Simplified] {monitor_config['name']}"
322
}
323
324
try:
325
return api.Monitor.create(**simplified_config)
326
except DatadogException as e:
327
logger.error(f"Simplified monitor creation failed: {e}")
328
raise
329
```
330
331
### Circuit Breaker Pattern
332
333
```python
334
from datadog import api
335
from datadog.api.exceptions import DatadogException
336
import time
337
from threading import Lock
338
339
class DatadogCircuitBreaker:
340
"""Circuit breaker for Datadog API calls."""
341
342
def __init__(self, failure_threshold=5, recovery_timeout=60):
343
self.failure_threshold = failure_threshold
344
self.recovery_timeout = recovery_timeout
345
self.failure_count = 0
346
self.last_failure_time = None
347
self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN
348
self.lock = Lock()
349
350
def call(self, func, *args, **kwargs):
351
"""Execute function with circuit breaker protection."""
352
353
with self.lock:
354
if self.state == 'OPEN':
355
if time.time() - self.last_failure_time > self.recovery_timeout:
356
self.state = 'HALF_OPEN'
357
else:
358
raise DatadogException("Circuit breaker is OPEN")
359
360
try:
361
result = func(*args, **kwargs)
362
363
with self.lock:
364
# Success resets failure count
365
self.failure_count = 0
366
if self.state == 'HALF_OPEN':
367
self.state = 'CLOSED'
368
369
return result
370
371
except DatadogException as e:
372
with self.lock:
373
self.failure_count += 1
374
self.last_failure_time = time.time()
375
376
if self.failure_count >= self.failure_threshold:
377
self.state = 'OPEN'
378
379
raise
380
381
# Usage
382
circuit_breaker = DatadogCircuitBreaker()
383
384
def safe_api_call(func, *args, **kwargs):
385
"""Make API call with circuit breaker protection."""
386
try:
387
return circuit_breaker.call(func, *args, **kwargs)
388
except DatadogException as e:
389
print(f"API call failed (circuit breaker): {e}")
390
return None
391
392
# Protected API calls
393
event = safe_api_call(
394
api.Event.create,
395
title="Test Event",
396
text="Circuit breaker test"
397
)
398
399
monitors = safe_api_call(api.Monitor.get_all)
400
```
401
402
### Comprehensive Error Logging
403
404
```python
405
from datadog import initialize, api
406
from datadog.api.exceptions import *
407
import logging
408
import traceback
409
410
# Configure logging
411
logging.basicConfig(level=logging.INFO)
412
logger = logging.getLogger(__name__)
413
414
# Initialize with error suppression disabled
415
initialize(
416
api_key="your-api-key",
417
app_key="your-app-key",
418
mute=False,
419
return_raw_response=True
420
)
421
422
def log_datadog_error(operation, exception, **context):
423
"""Comprehensive error logging for Datadog operations."""
424
425
error_details = {
426
'operation': operation,
427
'exception_type': type(exception).__name__,
428
'error_message': str(exception),
429
'context': context
430
}
431
432
if isinstance(exception, HTTPError):
433
error_details.update({
434
'status_code': exception.status_code,
435
'response_body': getattr(exception.response, 'text', 'N/A'),
436
'request_url': getattr(exception.response.request, 'url', 'N/A'),
437
'request_method': getattr(exception.response.request, 'method', 'N/A')
438
})
439
440
if isinstance(exception, (HttpTimeout, HttpBackoff)):
441
error_details['retry_recommended'] = True
442
443
logger.error(f"Datadog operation failed: {error_details}")
444
445
# Log full traceback for debugging
446
logger.debug(f"Full traceback: {traceback.format_exc()}")
447
448
def robust_datadog_operation(operation_func, operation_name, **kwargs):
449
"""Execute Datadog operation with comprehensive error handling."""
450
451
try:
452
result = operation_func(**kwargs)
453
logger.info(f"Datadog operation succeeded: {operation_name}")
454
return result
455
456
except ApiNotInitialized as e:
457
log_datadog_error(operation_name, e, **kwargs)
458
raise # Re-raise as this is a configuration issue
459
460
except HttpTimeout as e:
461
log_datadog_error(operation_name, e, **kwargs)
462
# Could implement retry logic here
463
raise
464
465
except HTTPError as e:
466
log_datadog_error(operation_name, e, **kwargs)
467
468
if e.status_code == 401:
469
logger.critical("Authentication failed - check API keys")
470
elif e.status_code == 403:
471
logger.critical("Authorization failed - check permissions")
472
elif e.status_code == 429:
473
logger.warning("Rate limited - implement backoff")
474
elif e.status_code >= 500:
475
logger.warning("Server error - may be temporary")
476
477
raise
478
479
except ApiError as e:
480
log_datadog_error(operation_name, e, **kwargs)
481
raise
482
483
except ClientError as e:
484
log_datadog_error(operation_name, e, **kwargs)
485
raise
486
487
except Exception as e:
488
log_datadog_error(operation_name, e, **kwargs)
489
logger.error(f"Unexpected error in Datadog operation: {e}")
490
raise
491
492
# Usage examples
493
try:
494
monitor = robust_datadog_operation(
495
api.Monitor.create,
496
"create_monitor",
497
type="metric alert",
498
query="avg(last_5m):avg:system.cpu.user{*} > 80",
499
name="High CPU usage"
500
)
501
except Exception:
502
print("Monitor creation failed - check logs")
503
504
try:
505
events = robust_datadog_operation(
506
api.Event.query,
507
"query_events",
508
start=1234567890,
509
end=1234567899
510
)
511
except Exception:
512
print("Event query failed - check logs")
513
```
514
515
### StatsD Error Resilience Patterns
516
517
```python
518
from datadog import statsd
519
import logging
520
import socket
521
522
logger = logging.getLogger(__name__)
523
524
def resilient_statsd_submit(metric_name, value, **kwargs):
525
"""Submit StatsD metric with error resilience."""
526
527
try:
528
statsd.gauge(metric_name, value, **kwargs)
529
return True
530
531
except socket.error as e:
532
logger.warning(f"StatsD socket error: {e}")
533
# StatsD errors shouldn't block application
534
return False
535
536
except Exception as e:
537
logger.warning(f"Unexpected StatsD error: {e}")
538
return False
539
540
def batch_statsd_with_recovery(metrics_batch):
541
"""Submit batch of StatsD metrics with individual error recovery."""
542
543
success_count = 0
544
545
for metric in metrics_batch:
546
try:
547
if metric['type'] == 'gauge':
548
statsd.gauge(metric['name'], metric['value'], tags=metric.get('tags'))
549
elif metric['type'] == 'increment':
550
statsd.increment(metric['name'], metric['value'], tags=metric.get('tags'))
551
elif metric['type'] == 'timing':
552
statsd.timing(metric['name'], metric['value'], tags=metric.get('tags'))
553
554
success_count += 1
555
556
except Exception as e:
557
logger.warning(f"Failed to submit metric {metric['name']}: {e}")
558
# Continue with remaining metrics
559
continue
560
561
logger.info(f"Submitted {success_count}/{len(metrics_batch)} metrics successfully")
562
return success_count
563
564
# Usage
565
metrics = [
566
{'type': 'gauge', 'name': 'system.cpu.usage', 'value': 75.0, 'tags': ['host:web01']},
567
{'type': 'increment', 'name': 'web.requests', 'value': 1, 'tags': ['endpoint:/api']},
568
{'type': 'timing', 'name': 'db.query.time', 'value': 150, 'tags': ['table:users']}
569
]
570
571
batch_statsd_with_recovery(metrics)
572
```
573
574
## Error Handling Best Practices
575
576
### Appropriate Error Suppression
577
578
```python
579
# Production: Suppress errors to prevent application crashes
580
initialize(
581
api_key=os.environ['DATADOG_API_KEY'],
582
app_key=os.environ['DATADOG_APP_KEY'],
583
mute=True # Suppress exceptions in production
584
)
585
586
# Development: Enable errors for debugging
587
initialize(
588
api_key="dev-api-key",
589
app_key="dev-app-key",
590
mute=False # Show all errors during development
591
)
592
```
593
594
### Monitoring and Alerting Resilience
595
596
```python
597
# Critical monitoring should not fail application
598
def submit_critical_metric(metric_name, value):
599
try:
600
statsd.gauge(metric_name, value)
601
except:
602
# Never let metrics submission crash critical application flow
603
pass
604
605
# Non-critical operations can have explicit error handling
606
def create_dashboard_with_handling(dashboard_config):
607
try:
608
return api.Dashboard.create(**dashboard_config)
609
except DatadogException as e:
610
logger.error(f"Dashboard creation failed: {e}")
611
return None # Graceful degradation
612
```
613
614
### Retry Strategy Guidelines
615
616
```python
617
# Retry on transient errors
618
RETRYABLE_ERRORS = (HttpTimeout, HttpBackoff)
619
620
# Don't retry on permanent errors
621
NON_RETRYABLE_ERRORS = (ApiNotInitialized,)
622
623
# Conditional retry on HTTP errors
624
def should_retry_http_error(http_error):
625
return http_error.status_code in [429, 500, 502, 503, 504]
626
```