Monitoring và Alerting trong System Design
Tổng Quan
Monitoring và alerting là components thiết yếu để đảm bảo system reliability, performance, và availability. Giúp phát hiện sớm các vấn đề và đưa ra cảnh báo kịp thời.
Các Loại Monitoring
1. Infrastructure Monitoring
- CPU, Memory, Disk utilization
- Network throughput và latency
- Database performance metrics
- Container và orchestration metrics
2. Application Monitoring
# Application performance metrics
class ApplicationMetrics:
def __init__(self):
self.response_time_histogram = Histogram('response_time_seconds')
self.request_count = Counter('http_requests_total')
self.error_rate = Counter('http_errors_total')
def record_request(self, method, endpoint, duration, status_code):
self.response_time_histogram.observe(duration)
self.request_count.inc()
if status_code >= 400:
self.error_rate.inc()
3. Business Metrics
- User engagement metrics
- Revenue và conversion rates
- Feature adoption rates
- Customer satisfaction scores
Key Metrics (Golden Signals)
The Four Golden Signals
1. Latency: Request processing time
2. Traffic: Request rate
3. Errors: Error rate percentage
4. Saturation: Resource utilization
SLA/SLO/SLI
# Service Level Indicators example
class SLIMetrics:
def availability(self):
# Percentage of successful requests
return (successful_requests / total_requests) * 100
def latency_p99(self):
# 99th percentile response time
return calculate_percentile(response_times, 99)
def error_rate(self):
# Percentage of failed requests
return (failed_requests / total_requests) * 100
Monitoring Tools và Technologies
Metrics Collection
# Prometheus configuration
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'api-servers'
static_configs:
- targets: ['api1:8080', 'api2:8080']
scrape_interval: 5s
metrics_path: '/metrics'
Log Aggregation
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "user-service",
"trace_id": "abc123",
"message": "Database connection failed",
"error": {
"type": "ConnectionError",
"stack_trace": "..."
}
}
Alerting Strategies
Alert Rules
# Prometheus alerting rules
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(response_time_seconds_bucket[5m])) > 1
for: 2m
labels:
severity: warning
Alert Escalation
class AlertEscalation:
def __init__(self):
self.escalation_levels = [
{'time': 0, 'channels': ['slack']},
{'time': 15, 'channels': ['email', 'sms']},
{'time': 30, 'channels': ['pagerduty']}
]
def escalate(self, alert, duration):
for level in self.escalation_levels:
if duration >= level['time']:
self.send_notifications(alert, level['channels'])
Observability Stack
Three Pillars of Observability
1. Metrics: Numerical measurements over time
2. Logs: Discrete events with context
3. Traces: Request flows across services
Distributed Tracing
# OpenTelemetry tracing example
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_user_request(user_id):
with tracer.start_as_current_span("process_user_request") as span:
span.set_attribute("user.id", user_id)
# Call database
with tracer.start_as_current_span("database_query") as db_span:
user = get_user_from_db(user_id)
db_span.set_attribute("db.query", "SELECT * FROM users")
return user
Dashboards và Visualization
Key Dashboard Components
- System health overview
- Real-time metrics
- Historical trends
- Error rates và patterns
- Resource utilization
Grafana Dashboard Example
{
"dashboard": {
"title": "API Performance Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
]
}
]
}
}
Incident Response
Incident Lifecycle
1. Detection: Alert fires
2. Response: On-call engineer notified
3. Mitigation: Immediate fixes applied
4. Resolution: Root cause addressed
5. Post-mortem: Lessons learned documented
Runbooks
# High Error Rate Runbook
## Symptoms
- Error rate > 10% for 5+ minutes
- Multiple customer complaints
- Database timeout errors in logs
## Investigation Steps
1. Check database connectivity
2. Verify API server health
3. Review recent deployments
4. Check external dependencies
## Mitigation
- Scale up API servers
- Restart failed services
- Rollback recent deployment if needed
Best Practices
Monitoring Strategy
- Monitor business metrics, not just technical
- Set up proactive alerts, not just reactive
- Use percentiles, not just averages
- Implement graceful degradation
Alert Fatigue Prevention
- Tune alert thresholds carefully
- Group related alerts
- Implement alert suppression
- Regular alert review và cleanup
Documentation
- Document all metrics và their meanings
- Maintain updated runbooks
- Create troubleshooting guides
- Record incident post-mortems
Next Steps
Nội dung này sẽ được mở rộng thêm với: - Advanced monitoring patterns - Chaos engineering practices - Performance testing integration - Cost optimization monitoring - Multi-cloud monitoring strategies