Domain 1: Design Resilient Architectures (30%)

🎯 Overview

Thiết kế các architecture có thể chịu được failures và tự recovery, đảm bảo high availability và fault tolerance.

📚 Key Concepts

1. High Availability (HA)

  • Definition: Hệ thống hoạt động liên tục với minimal downtime
  • Target: 99.9% - 99.99% uptime
  • Implementation:
  • Multiple Availability Zones
  • Load balancing
  • Auto scaling
  • Health checks

2. Fault Tolerance

  • Definition: Hệ thống tiếp tục hoạt động khi một số components fail
  • Strategies:
  • Redundancy
  • Graceful degradation
  • Circuit breaker patterns
  • Bulkhead isolation

3. Disaster Recovery (DR)

  • RTO (Recovery Time Objective): Thời gian maximum để recover
  • RPO (Recovery Point Objective): Maximum data loss acceptable

DR Strategies (từ least đến most expensive):

  1. Backup & Restore (RPO: hours, RTO: 24hrs)
  2. Pilot Light (RPO: minutes, RTO: hours)
  3. Warm Standby (RPO: seconds, RTO: minutes)
  4. Multi-Site Active/Active (RPO: real-time, RTO: automatic)

🏗️ Multi-Tier Architecture Solutions

3-Tier Architecture

Internet Gateway
    ↓
Application Load Balancer (Public Subnet)
    ↓
Web Servers (Private Subnet)
    ↓
Database (Private Subnet/Isolated)

Components:

  • Presentation Tier: Web servers, load balancers
  • Application Tier: Application servers, business logic
  • Data Tier: Databases, caching layers

Best Practices:

  • Separate subnets cho mỗi tier
  • Security groups restrictive
  • Private subnets cho sensitive components
  • Multiple AZs cho redundancy

🔧 Decoupling Mechanisms

1. Load Balancers

  • Application Load Balancer (ALB):
  • Layer 7 (HTTP/HTTPS)
  • Content-based routing
  • WebSocket support

  • Network Load Balancer (NLB):

  • Layer 4 (TCP/UDP)
  • Ultra-low latency
  • Static IP addresses

  • Classic Load Balancer (CLB):

  • Legacy option
  • Basic load balancing

2. Message Queues

  • Amazon SQS:
  • Standard queues (at-least-once delivery)
  • FIFO queues (exactly-once processing)
  • Dead letter queues

  • Amazon SNS:

  • Pub/Sub messaging
  • Multiple delivery protocols
  • Fan-out patterns

3. Event-Driven Architecture

  • Amazon EventBridge:
  • Event bus service
  • Custom applications integration
  • SaaS applications integration

4. API Gateway

  • Features:
  • Rate limiting
  • Authentication
  • Request/response transformation
  • Caching

🌍 Multi-AZ & Multi-Region Architectures

Multi-AZ Deployment

Region (us-east-1)
├── AZ-1a: Web Server, DB Primary
├── AZ-1b: Web Server, DB Standby
└── AZ-1c: Web Server

Multi-Region Deployment

Primary Region (us-east-1)
├── Full application stack
├── Database primary
└── Active traffic

Secondary Region (us-west-2)
├── Disaster recovery stack
├── Database replica
└── Standby mode

💾 Storage Resilience

Amazon S3

  • Durability: 99.999999999% (11 9's)
  • Availability: 99.99%
  • Cross-Region Replication (CRR)
  • Versioning & MFA Delete
  • Storage Classes:
  • Standard (frequent access)
  • IA (infrequent access)
  • Glacier (archival)

Amazon EBS

  • Snapshots: Point-in-time backups
  • Multi-Attach: Shared storage
  • Encryption: At rest và in transit
  • Volume Types: gp3, io2, etc.

Amazon EFS

  • Regional service (multiple AZ)
  • Backup: Automatic backups
  • Replication: Cross-region

🗄️ Database Resilience

Amazon RDS

  • Multi-AZ Deployments:
  • Synchronous replication
  • Automatic failover
  • Backup từ standby

  • Read Replicas:

  • Asynchronous replication
  • Read scaling
  • Cross-region replicas

Amazon DynamoDB

  • Global Tables: Multi-region replication
  • Point-in-Time Recovery: 35 days
  • On-Demand Backup: Full backups
  • Auto Scaling: Based on demand

🔄 Auto Scaling Strategies

EC2 Auto Scaling

Auto Scaling Group:
  MinSize: 2
  DesiredCapacity: 4
  MaxSize: 10

Scaling Policies:
  - Target CPU > 70%: Scale Out
  - Target CPU < 30%: Scale In

Health Checks:
  - EC2 Status Check
  - ELB Health Check

Application Auto Scaling

  • DynamoDB: Read/Write capacity
  • ECS: Service tasks
  • Lambda: Concurrent executions

📊 Monitoring & Health Checks

Amazon CloudWatch

  • Metrics: CPU, memory, disk, network
  • Alarms: Threshold-based notifications
  • Dashboards: Visual monitoring
  • Logs: Centralized logging

AWS Systems Manager

  • Session Manager: Secure shell access
  • Parameter Store: Configuration management
  • Patch Manager: OS patching

Health Checks

  • Route 53 Health Checks:
  • Endpoint monitoring
  • Calculated health checks
  • DNS failover

  • ELB Health Checks:

  • HTTP/HTTPS checks
  • TCP health checks
  • Custom health check paths

🛡️ Resilience Patterns

1. Circuit Breaker

# Pseudo code
if failure_rate > threshold:
    return cached_response
else:
    call_service()

2. Retry with Exponential Backoff

# Pseudo code
for attempt in range(max_retries):
    try:
        return call_service()
    except:
        wait(2^attempt + random_jitter)

3. Bulkhead Isolation

  • Separate resources cho different functions
  • Prevent cascade failures
  • Independent scaling

🎯 Common Scenarios & Solutions

Scenario 1: Web Application HA

Requirement: 99.9% availability Solution: - Multi-AZ ALB - Auto Scaling Group (min 2 instances) - RDS Multi-AZ - S3 static content - CloudFront CDN

Scenario 2: API Service Resilience

Requirement: Handle traffic spikes Solution: - API Gateway với caching - Lambda functions - DynamoDB auto scaling - SQS queue cho async processing

Scenario 3: Data Processing Pipeline

Requirement: Process large datasets reliably Solution: - S3 event notifications - SQS/SNS for decoupling - Lambda/ECS for processing - DLQ for failed messages - CloudWatch monitoring

📝 Best Practices Checklist

Design Principles

  • [ ] Design for failure
  • [ ] Implement redundancy
  • [ ] Use managed services
  • [ ] Decouple components
  • [ ] Implement graceful degradation

Architecture Review

  • [ ] Single points of failure identified
  • [ ] Multiple AZs utilized
  • [ ] Auto scaling configured
  • [ ] Health checks implemented
  • [ ] Monitoring và alerting setup

Testing

  • [ ] Chaos engineering
  • [ ] Disaster recovery drills
  • [ ] Load testing
  • [ ] Failover testing

🔍 Practice Questions

  1. Multi-AZ RDS: Khi nào sử dụng Multi-AZ vs Read Replicas?
  2. Auto Scaling: Làm thế nào để configure predictive scaling?
  3. Load Balancer: Khi nào chọn ALB vs NLB vs GLB?
  4. Storage: S3 Cross-Region Replication use cases?
  5. Disaster Recovery: So sánh các DR strategies?

📖 Further Reading