Domain 1: Design Resilient Architectures (30%)
🎯 Overview
Thiết kế các architecture có thể chịu được failures và tự recovery, đảm bảo high availability và fault tolerance.
📚 Key Concepts
1. High Availability (HA)
- Definition: Hệ thống hoạt động liên tục với minimal downtime
- Target: 99.9% - 99.99% uptime
- Implementation:
- Multiple Availability Zones
- Load balancing
- Auto scaling
- Health checks
2. Fault Tolerance
- Definition: Hệ thống tiếp tục hoạt động khi một số components fail
- Strategies:
- Redundancy
- Graceful degradation
- Circuit breaker patterns
- Bulkhead isolation
3. Disaster Recovery (DR)
- RTO (Recovery Time Objective): Thời gian maximum để recover
- RPO (Recovery Point Objective): Maximum data loss acceptable
DR Strategies (từ least đến most expensive):
- Backup & Restore (RPO: hours, RTO: 24hrs)
- Pilot Light (RPO: minutes, RTO: hours)
- Warm Standby (RPO: seconds, RTO: minutes)
- Multi-Site Active/Active (RPO: real-time, RTO: automatic)
🏗️ Multi-Tier Architecture Solutions
3-Tier Architecture
Internet Gateway
↓
Application Load Balancer (Public Subnet)
↓
Web Servers (Private Subnet)
↓
Database (Private Subnet/Isolated)
Components:
- Presentation Tier: Web servers, load balancers
- Application Tier: Application servers, business logic
- Data Tier: Databases, caching layers
Best Practices:
- Separate subnets cho mỗi tier
- Security groups restrictive
- Private subnets cho sensitive components
- Multiple AZs cho redundancy
🔧 Decoupling Mechanisms
1. Load Balancers
- Application Load Balancer (ALB):
- Layer 7 (HTTP/HTTPS)
- Content-based routing
-
WebSocket support
-
Network Load Balancer (NLB):
- Layer 4 (TCP/UDP)
- Ultra-low latency
-
Static IP addresses
-
Classic Load Balancer (CLB):
- Legacy option
- Basic load balancing
2. Message Queues
- Amazon SQS:
- Standard queues (at-least-once delivery)
- FIFO queues (exactly-once processing)
-
Dead letter queues
-
Amazon SNS:
- Pub/Sub messaging
- Multiple delivery protocols
- Fan-out patterns
3. Event-Driven Architecture
- Amazon EventBridge:
- Event bus service
- Custom applications integration
- SaaS applications integration
4. API Gateway
- Features:
- Rate limiting
- Authentication
- Request/response transformation
- Caching
🌍 Multi-AZ & Multi-Region Architectures
Multi-AZ Deployment
Region (us-east-1)
├── AZ-1a: Web Server, DB Primary
├── AZ-1b: Web Server, DB Standby
└── AZ-1c: Web Server
Multi-Region Deployment
Primary Region (us-east-1)
├── Full application stack
├── Database primary
└── Active traffic
Secondary Region (us-west-2)
├── Disaster recovery stack
├── Database replica
└── Standby mode
💾 Storage Resilience
Amazon S3
- Durability: 99.999999999% (11 9's)
- Availability: 99.99%
- Cross-Region Replication (CRR)
- Versioning & MFA Delete
- Storage Classes:
- Standard (frequent access)
- IA (infrequent access)
- Glacier (archival)
Amazon EBS
- Snapshots: Point-in-time backups
- Multi-Attach: Shared storage
- Encryption: At rest và in transit
- Volume Types: gp3, io2, etc.
Amazon EFS
- Regional service (multiple AZ)
- Backup: Automatic backups
- Replication: Cross-region
🗄️ Database Resilience
Amazon RDS
- Multi-AZ Deployments:
- Synchronous replication
- Automatic failover
-
Backup từ standby
-
Read Replicas:
- Asynchronous replication
- Read scaling
- Cross-region replicas
Amazon DynamoDB
- Global Tables: Multi-region replication
- Point-in-Time Recovery: 35 days
- On-Demand Backup: Full backups
- Auto Scaling: Based on demand
🔄 Auto Scaling Strategies
EC2 Auto Scaling
Auto Scaling Group:
MinSize: 2
DesiredCapacity: 4
MaxSize: 10
Scaling Policies:
- Target CPU > 70%: Scale Out
- Target CPU < 30%: Scale In
Health Checks:
- EC2 Status Check
- ELB Health Check
Application Auto Scaling
- DynamoDB: Read/Write capacity
- ECS: Service tasks
- Lambda: Concurrent executions
📊 Monitoring & Health Checks
Amazon CloudWatch
- Metrics: CPU, memory, disk, network
- Alarms: Threshold-based notifications
- Dashboards: Visual monitoring
- Logs: Centralized logging
AWS Systems Manager
- Session Manager: Secure shell access
- Parameter Store: Configuration management
- Patch Manager: OS patching
Health Checks
- Route 53 Health Checks:
- Endpoint monitoring
- Calculated health checks
-
DNS failover
-
ELB Health Checks:
- HTTP/HTTPS checks
- TCP health checks
- Custom health check paths
🛡️ Resilience Patterns
1. Circuit Breaker
# Pseudo code
if failure_rate > threshold:
return cached_response
else:
call_service()
2. Retry with Exponential Backoff
# Pseudo code
for attempt in range(max_retries):
try:
return call_service()
except:
wait(2^attempt + random_jitter)
3. Bulkhead Isolation
- Separate resources cho different functions
- Prevent cascade failures
- Independent scaling
🎯 Common Scenarios & Solutions
Scenario 1: Web Application HA
Requirement: 99.9% availability Solution: - Multi-AZ ALB - Auto Scaling Group (min 2 instances) - RDS Multi-AZ - S3 static content - CloudFront CDN
Scenario 2: API Service Resilience
Requirement: Handle traffic spikes Solution: - API Gateway với caching - Lambda functions - DynamoDB auto scaling - SQS queue cho async processing
Scenario 3: Data Processing Pipeline
Requirement: Process large datasets reliably Solution: - S3 event notifications - SQS/SNS for decoupling - Lambda/ECS for processing - DLQ for failed messages - CloudWatch monitoring
📝 Best Practices Checklist
Design Principles
- [ ] Design for failure
- [ ] Implement redundancy
- [ ] Use managed services
- [ ] Decouple components
- [ ] Implement graceful degradation
Architecture Review
- [ ] Single points of failure identified
- [ ] Multiple AZs utilized
- [ ] Auto scaling configured
- [ ] Health checks implemented
- [ ] Monitoring và alerting setup
Testing
- [ ] Chaos engineering
- [ ] Disaster recovery drills
- [ ] Load testing
- [ ] Failover testing
🔍 Practice Questions
- Multi-AZ RDS: Khi nào sử dụng Multi-AZ vs Read Replicas?
- Auto Scaling: Làm thế nào để configure predictive scaling?
- Load Balancer: Khi nào chọn ALB vs NLB vs GLB?
- Storage: S3 Cross-Region Replication use cases?
- Disaster Recovery: So sánh các DR strategies?