Monitoring và Alerting trong System Design

Tổng Quan

Monitoring và alerting là components thiết yếu để đảm bảo system reliability, performance, và availability. Giúp phát hiện sớm các vấn đề và đưa ra cảnh báo kịp thời.

Các Loại Monitoring

1. Infrastructure Monitoring

- CPU, Memory, Disk utilization
- Network throughput và latency
- Database performance metrics
- Container và orchestration metrics

2. Application Monitoring

# Application performance metrics
class ApplicationMetrics:
    def __init__(self):
        self.response_time_histogram = Histogram('response_time_seconds')
        self.request_count = Counter('http_requests_total')
        self.error_rate = Counter('http_errors_total')

    def record_request(self, method, endpoint, duration, status_code):
        self.response_time_histogram.observe(duration)
        self.request_count.inc()
        if status_code >= 400:
            self.error_rate.inc()

3. Business Metrics

- User engagement metrics
- Revenue và conversion rates
- Feature adoption rates
- Customer satisfaction scores

Key Metrics (Golden Signals)

The Four Golden Signals

1. Latency: Request processing time
2. Traffic: Request rate
3. Errors: Error rate percentage
4. Saturation: Resource utilization

SLA/SLO/SLI

# Service Level Indicators example
class SLIMetrics:
    def availability(self):
        # Percentage of successful requests
        return (successful_requests / total_requests) * 100

    def latency_p99(self):
        # 99th percentile response time
        return calculate_percentile(response_times, 99)

    def error_rate(self):
        # Percentage of failed requests
        return (failed_requests / total_requests) * 100

Monitoring Tools và Technologies

Metrics Collection

# Prometheus configuration
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'api-servers'
    static_configs:
      - targets: ['api1:8080', 'api2:8080']
    scrape_interval: 5s
    metrics_path: '/metrics'

Log Aggregation

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "user-service",
  "trace_id": "abc123",
  "message": "Database connection failed",
  "error": {
    "type": "ConnectionError",
    "stack_trace": "..."
  }
}

Alerting Strategies

Alert Rules

# Prometheus alerting rules
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(response_time_seconds_bucket[5m])) > 1
        for: 2m
        labels:
          severity: warning

Alert Escalation

class AlertEscalation:
    def __init__(self):
        self.escalation_levels = [
            {'time': 0, 'channels': ['slack']},
            {'time': 15, 'channels': ['email', 'sms']},
            {'time': 30, 'channels': ['pagerduty']}
        ]

    def escalate(self, alert, duration):
        for level in self.escalation_levels:
            if duration >= level['time']:
                self.send_notifications(alert, level['channels'])

Observability Stack

Three Pillars of Observability

1. Metrics: Numerical measurements over time
2. Logs: Discrete events with context
3. Traces: Request flows across services

Distributed Tracing

# OpenTelemetry tracing example
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_user_request(user_id):
    with tracer.start_as_current_span("process_user_request") as span:
        span.set_attribute("user.id", user_id)

        # Call database
        with tracer.start_as_current_span("database_query") as db_span:
            user = get_user_from_db(user_id)
            db_span.set_attribute("db.query", "SELECT * FROM users")

        return user

Dashboards và Visualization

Key Dashboard Components

- System health overview
- Real-time metrics
- Historical trends
- Error rates và patterns
- Resource utilization

Grafana Dashboard Example

{
  "dashboard": {
    "title": "API Performance Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{endpoint}}"
          }
        ]
      }
    ]
  }
}

Incident Response

Incident Lifecycle

1. Detection: Alert fires
2. Response: On-call engineer notified
3. Mitigation: Immediate fixes applied
4. Resolution: Root cause addressed
5. Post-mortem: Lessons learned documented

Runbooks

# High Error Rate Runbook

## Symptoms
- Error rate > 10% for 5+ minutes
- Multiple customer complaints
- Database timeout errors in logs

## Investigation Steps
1. Check database connectivity
2. Verify API server health
3. Review recent deployments
4. Check external dependencies

## Mitigation
- Scale up API servers
- Restart failed services
- Rollback recent deployment if needed

Best Practices

Monitoring Strategy

- Monitor business metrics, not just technical
- Set up proactive alerts, not just reactive
- Use percentiles, not just averages
- Implement graceful degradation

Alert Fatigue Prevention

- Tune alert thresholds carefully
- Group related alerts
- Implement alert suppression
- Regular alert review và cleanup

Documentation

- Document all metrics và their meanings
- Maintain updated runbooks
- Create troubleshooting guides
- Record incident post-mortems

Next Steps

Nội dung này sẽ được mở rộng thêm với: - Advanced monitoring patterns - Chaos engineering practices - Performance testing integration - Cost optimization monitoring - Multi-cloud monitoring strategies