System health monitoring, alerts, and error tracking
68
56%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./src/skills/bundled/monitoring/SKILL.mdMonitor system health, track errors, and receive alerts when issues occur.
/monitor start # Start monitoring
/monitor stop # Stop monitoring
/monitor status # Check monitoring status/monitor health # Run health check
/monitor health --verbose # Detailed health info
/monitor providers # Check LLM provider status/monitor alerts # View recent alerts
/monitor alerts --unread # Unread alerts only
/monitor alert-targets # View alert destinations
/monitor alert-targets add email <addr> # Add email target
/monitor alert-targets add webhook <url> # Add webhook target
/monitor alert-targets remove <id> # Remove target/monitor config # View config
/monitor cooldown 300 # Set alert cooldown (seconds)
/monitor threshold cpu 80 # Set CPU alert threshold
/monitor threshold memory 90 # Set memory thresholdimport { createMonitoringService } from 'clodds/monitoring';
const monitor = createMonitoringService({
// Health check interval
intervalMs: 60000, // 1 minute
// Alert targets
alertTargets: [
{ type: 'email', address: 'alerts@example.com' },
{ type: 'webhook', url: 'https://hooks.example.com/alerts' },
],
// Alert cooldown (prevent spam)
alertCooldownMs: 300000, // 5 minutes
// Thresholds
thresholds: {
cpu: 80, // Alert at 80% CPU
memory: 90, // Alert at 90% memory
errorRate: 10, // Alert at 10% error rate
},
});// Start monitoring
await monitor.start();
// Check if running
const isRunning = monitor.isRunning();
// Stop monitoring
await monitor.stop();// Run health check
const health = await monitor.runHealthCheck();
console.log(`Overall: ${health.status}`); // 'healthy' | 'degraded' | 'unhealthy'
console.log('\nSystem:');
console.log(` CPU: ${health.system.cpu}%`);
console.log(` Memory: ${health.system.memory}%`);
console.log(` Disk: ${health.system.disk}%`);
console.log('\nProviders:');
for (const [name, status] of Object.entries(health.providers)) {
console.log(` ${name}: ${status.status} (${status.latencyMs}ms)`);
}
console.log('\nServices:');
for (const [name, status] of Object.entries(health.services)) {
console.log(` ${name}: ${status.status}`);
}// Check LLM provider status
const providers = await monitor.checkProviders();
for (const provider of providers) {
console.log(`${provider.name}:`);
console.log(` Status: ${provider.status}`);
console.log(` Latency: ${provider.latencyMs}ms`);
console.log(` Last error: ${provider.lastError || 'none'}`);
console.log(` Error rate: ${provider.errorRate}%`);
}// Get recent alerts
const alerts = await monitor.getAlerts({ limit: 10 });
for (const alert of alerts) {
console.log(`[${alert.severity}] ${alert.title}`);
console.log(` ${alert.message}`);
console.log(` Time: ${alert.timestamp}`);
console.log(` Acknowledged: ${alert.acknowledged}`);
}
// Acknowledge alert
await monitor.acknowledgeAlert(alertId);
// Get unread count
const unread = await monitor.getUnreadAlertCount();// Add alert target
await monitor.addAlertTarget({
type: 'email',
address: 'team@example.com',
});
await monitor.addAlertTarget({
type: 'webhook',
url: 'https://hooks.slack.com/...',
});
// List targets
const targets = monitor.getAlertTargets();
// Remove target
await monitor.removeAlertTarget(targetId);// Listen for events
monitor.on('alert', (alert) => {
console.log(`🚨 Alert: ${alert.title}`);
});
monitor.on('healthCheck', (health) => {
if (health.status !== 'healthy') {
console.log(`⚠️ System ${health.status}`);
}
});
monitor.on('providerDown', (provider) => {
console.log(`❌ Provider down: ${provider.name}`);
});
monitor.on('providerRecovered', (provider) => {
console.log(`✅ Provider recovered: ${provider.name}`);
});// Send manual alert
await monitor.sendAlert({
severity: 'warning', // 'info' | 'warning' | 'error' | 'critical'
title: 'Custom Alert',
message: 'Something important happened',
metadata: { key: 'value' },
});| Type | Trigger |
|---|---|
| provider_down | LLM provider not responding |
| high_cpu | CPU usage above threshold |
| high_memory | Memory usage above threshold |
| high_error_rate | Error rate above threshold |
| unhandled_exception | Uncaught exception |
| unhandled_rejection | Unhandled promise rejection |
// Update config
monitor.configure({
intervalMs: 30000,
alertCooldownMs: 600000,
thresholds: {
cpu: 85,
memory: 95,
errorRate: 5,
},
});2a8c94e
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.