Amazon CloudWatch - Comprehensive Guide
📊 Tổng quan về CloudWatch
CloudWatch là gì?
- Monitoring service: Thu thập và track metrics, logs, events
- Real-time monitoring: Near real-time data processing
- Alerting: Automated responses to metric thresholds
- Dashboards: Visual monitoring interfaces
- Integration: Deep integration với AWS services
📈 CloudWatch Metrics
Default Metrics
EC2 Metrics:
- CPUUtilization
- NetworkIn/NetworkOut
- DiskReadOps/DiskWriteOps
- StatusCheckFailed
RDS Metrics:
- DatabaseConnections
- CPUUtilization
- FreeableMemory
- ReadLatency/WriteLatency
S3 Metrics:
- BucketSizeBytes
- NumberOfObjects
- AllRequests
Custom Metrics
import boto3
cloudwatch = boto3.client('cloudwatch')
# Put custom metric
cloudwatch.put_metric_data(
Namespace='MyApp/Performance',
MetricData=[
{
'MetricName': 'ResponseTime',
'Value': 0.5,
'Unit': 'Seconds',
'Dimensions': [
{
'Name': 'Environment',
'Value': 'Production'
}
]
}
]
)
🚨 CloudWatch Alarms
Alarm Types
Metric Alarms:
Purpose: Monitor single metric
Actions: SNS, Auto Scaling, EC2 actions
States: OK, ALARM, INSUFFICIENT_DATA
Composite Alarms:
Purpose: Combine multiple alarms
Logic: AND, OR conditions
Use case: Complex monitoring scenarios
Alarm Configuration
{
"AlarmName": "High-CPU-Usage",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 80,
"ComparisonOperator": "GreaterThanThreshold",
"AlarmActions": [
"arn:aws:sns:region:account:topic-name"
]
}
📝 CloudWatch Logs
Log Groups & Streams
Log Groups:
Purpose: Container cho log streams
Retention: 1 day to 10 years, never expire
Encryption: KMS encryption support
Log Streams:
Purpose: Sequence of log events
Source: Single source (instance, function)
Ordering: Chronological order
Log Insights
-- Query examples
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)
fields @timestamp, level, message
| filter level = "ERROR"
| sort @timestamp desc
| limit 100
📊 CloudWatch Dashboards
Dashboard Components
Widgets:
- Line charts: Time-series data
- Number widgets: Single metric value
- Text widgets: Markdown documentation
- Log widgets: Log insights queries
Features:
- Custom time ranges
- Auto-refresh
- Sharing capabilities
- Multiple regions support
⚡ CloudWatch Events/EventBridge
Event Sources
AWS Services:
- EC2 state changes
- Auto Scaling group events
- CodePipeline state changes
- Scheduled events (cron)
Custom Applications:
- Custom event patterns
- Third-party integrations
- API Gateway integration
Event Rules
{
"Rules": [
{
"Name": "EC2InstanceStateChange",
"EventPattern": {
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"state": ["running", "stopped"]
}
},
"Targets": [
{
"Id": "1",
"Arn": "arn:aws:lambda:region:account:function:ProcessEC2Event"
}
]
}
]
}
🔍 CloudWatch Container Insights
ECS Monitoring
Container Insights:
Metrics collected:
- CPU và memory utilization
- Network và disk I/O
- Task và service metrics
Log collection:
- Container logs
- Application logs
- Performance metrics
EKS Monitoring
Kubernetes Metrics:
- Cluster utilization
- Node performance
- Pod metrics
- Namespace resource usage
Log Sources:
- Application logs
- Kubernetes API server logs
- Controller manager logs
💰 CloudWatch Pricing
Cost Components
Metrics:
Standard: First 10 metrics free, then $0.30/metric/month
Custom: $0.30/metric/month
High-resolution: $0.30/metric/month
Logs:
Ingestion: $0.50/GB
Storage: $0.03/GB/month
Insights queries: $0.005/GB scanned
Alarms:
Standard: $0.10/alarm/month
High-resolution: $0.30/alarm/month
🛠️ Monitoring Best Practices
Metric Strategy
Golden Signals:
Latency: Response time metrics
Traffic: Request rate metrics
Errors: Error rate metrics
Saturation: Resource utilization
Business Metrics:
- Revenue-impacting KPIs
- User experience metrics
- Application-specific metrics
Alert Design
Effective Alerting:
- Clear severity levels
- Actionable notifications
- Avoid alert fatigue
- Include context information
Escalation:
- Primary on-call
- Backup contacts
- Manager escalation
- Automated remediation
🚀 Integration Patterns
Lambda Integration
import json
import boto3
def lambda_handler(event, context):
# Process CloudWatch alarm
message = json.loads(event['Records'][0]['Sns']['Message'])
alarm_name = message['AlarmName']
if message['NewStateValue'] == 'ALARM':
# Trigger remediation actions
auto_scaling = boto3.client('autoscaling')
auto_scaling.set_desired_capacity(
AutoScalingGroupName='my-asg',
DesiredCapacity=5
)
Auto Scaling Integration
CloudWatch → Auto Scaling:
Trigger: CPU utilization > 70%
Action: Scale out (add instances)
Trigger: CPU utilization < 30%
Action: Scale in (remove instances)
Cooldown: Prevent rapid scaling
📝 Exam Tips cho AWS SAA
Key Concepts
- Default vs custom metrics: What's included vs extra cost
- Alarm states: OK, ALARM, INSUFFICIENT_DATA
- Log retention: Cost implications of long retention
- High-resolution metrics: 1-second intervals for premium
Common Scenarios
- Auto Scaling: CloudWatch metrics trigger scaling
- Cost optimization: Right-size based on metrics
- Troubleshooting: Use logs and metrics together
- Compliance: Log retention for audit requirements
📖 Tóm tắt
CloudWatch cung cấp comprehensive monitoring solution cho AWS: - Metrics collection từ AWS services và custom applications - Real-time alerting với automated response capabilities - Log aggregation và analysis với Insights - Visualization với customizable dashboards - Integration với other AWS services for automation
Hiểu rõ CloudWatch capabilities và pricing model là essential cho designing cost-effective monitoring solutions trong AWS SAA exam.