Reliability & Availability - Building Fault-Tolerant Systems
Tổng quan
Reliability là khả năng system có thể perform correctly during specific time duration. Availability là percentage of time system operational. Hai concepts này fundamental cho building production systems có thể handle failures gracefully.
Core Definitions
Reliability
Probability system performs correctly during specific duration under specific conditions.
Reliability = MTBF / (MTBF + MTTR)
Where:
- MTBF = Mean Time Between Failures
- MTTR = Mean Time To Repair
Availability
Percentage of time system is operational và accessible.
Availability = Uptime / (Uptime + Downtime)
Example:
- 99.9% = 8.76 hours downtime/year
- 99.99% = 52.56 minutes downtime/year
The CAP Theorem
CAP Theorem: Trong distributed system, bạn chỉ có thể guarantee 2 trong 3:
🔄 Consistency (C)
All nodes see same data simultaneously.
Strong Consistency Example:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node A │ │ Node B │ │ Node C │
│ balance:100 │ ── │ balance:100 │ ── │ balance:100 │
└─────────────┘ └─────────────┘ └─────────────┘
Write to Node A: balance = 50
↓
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node A │ │ Node B │ │ Node C │
│ balance: 50 │ ── │ balance: 50 │ ── │ balance: 50 │
└─────────────┘ └─────────────┘ └─────────────┘
All nodes updated before responding to client
🚀 Availability (A)
System remains operational despite failures.
High Availability Setup:
┌─────────────┐ ┌─────────────┐
│ Primary │ │ Backup │
│ Server │───▶│ Server │
│ (Active) │ │ (Standby) │
└─────────────┘ └─────────────┘
If Primary fails:
┌─────────────┐ ┌─────────────┐
│ Primary │ │ Backup │
│ Server │ │ Server │
│ (DOWN) │ │ (Active) │
└─────────────┘ └─────────────┘
🌐 Partition Tolerance (P)
System continues despite network partitions.
Network Partition Scenario:
┌─────────────┐ │ ┌─────────────┐
│ Node A │ ─ ─ ─ ─ │ ─ ─ ─ ─ │ Node B │
│ │ Network│Partition │ │
└─────────────┘ │ └─────────────┘
Nodes cannot communicate but continue serving requests
CAP Theorem Trade-offs
CP Systems (Consistency + Partition Tolerance)
Examples: MongoDB, Redis Cluster, HBase
Trade-off: Sacrifice availability during partitions
Use case: Banking systems, financial transactions
Scenario:
- Network partition occurs
- System becomes unavailable rather than serving inconsistent data
- Better to be down than wrong
AP Systems (Availability + Partition Tolerance)
Examples: Cassandra, DynamoDB, CouchDB
Trade-off: Sacrifice consistency during partitions
Use case: Social media, content delivery
Scenario:
- Network partition occurs
- Each partition continues serving requests
- Data may be inconsistent until partition heals
CA Systems (Consistency + Availability)
Examples: Traditional RDBMS (MySQL, PostgreSQL)
Reality: Not truly distributed, single point of failure
Use case: Single-datacenter applications
Note: In distributed systems, network partitions will happen,
so true CA systems don't exist at scale
ACID Properties
⚛️ Atomicity
All operations trong transaction succeed or all fail.
-- Bank Transfer Example
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- Sender
UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- Receiver
COMMIT;
-- If any operation fails, entire transaction rolls back
-- No partial state: either both accounts updated or neither
🔒 Consistency
Database remains in valid state after transaction.
-- Constraint: Account balance >= 0
-- This transaction should fail:
UPDATE accounts SET balance = balance - 1000 WHERE balance = 50;
-- Violates consistency rule, transaction rejected
🔐 Isolation
Concurrent transactions don't interfere with each other.
-- Transaction A and B run simultaneously
-- Transaction A: Transfer $100 from Account 1 to Account 2
-- Transaction B: Check balance of Account 1
-- Isolation ensures Transaction B sees either:
-- 1. Balance before transfer, OR
-- 2. Balance after transfer
-- Never sees intermediate state
💾 Durability
Committed transactions persist even if system crashes.
-- After COMMIT completes:
COMMIT;
-- Data guaranteed to survive:
-- - Power failures
-- - System crashes
-- - Hardware failures
Consistency Models
Strong Consistency
All reads receive most recent write.
Timeline:
Write X=1 ────────────────▶ All nodes updated
│ │
└─ Read X returns 1 ────────┘
Pros: Simple to reason about
Cons: Higher latency, lower availability
Eventual Consistency
System will become consistent over time.
Timeline:
Write X=1 ─────┐
▼
Node A: X=1 │ Node B: X=0 Node C: X=0
│ │ │
└─────────▼──────────────▼
Eventually...
Node A: X=1 Node B: X=1 Node C: X=1
Pros: Higher availability, better performance
Cons: Temporary inconsistency possible
Weak Consistency
No guarantees when all nodes will be consistent.
Use cases:
- Gaming leaderboards
- Social media likes/views
- Real-time analytics
Example: YouTube view count may vary across regions
Fault Tolerance Patterns
1. Redundancy
Active-Passive (Hot Standby)
┌─────────────┐ ┌─────────────┐
│ Primary │───▶│ Backup │
│ Database │ │ Database │
│ (Active) │ │ (Standby) │
└─────────────┘ └─────────────┘
│
▼
All Traffic
Failover time: 30-120 seconds
Active-Active (Load Sharing)
┌─────────────┐ ┌─────────────┐
│ Database │◀──▶│ Database │
│ Node 1 │ │ Node 2 │
│ (Active) │ │ (Active) │
└─────────────┘ └─────────────┘
│ │
▼ ▼
50% Traffic 50% Traffic
Failover time: Near-instant
2. Circuit Breaker Pattern
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failure_count = 0
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, func):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.timeout:
self.state = "HALF_OPEN"
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise e
def on_success(self):
self.failure_count = 0
self.state = "CLOSED"
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
3. Bulkhead Pattern
Isolate critical resources to prevent cascading failures.
Thread Pool Isolation:
┌─────────────────────────────────────────────────────┐
│ Application Server │
├─────────────┬─────────────┬─────────────┬──────────┤
│ User │ Payment │ Search │ Admin │
│ Services │ Services │ Services │ Services │
│ (Pool: 20) │ (Pool: 10) │ (Pool: 15) │(Pool: 5) │
└─────────────┴─────────────┴─────────────┴──────────┘
If Payment service hangs, it only affects its pool,
other services continue functioning.
4. Retry with Exponential Backoff
import time
import random
def retry_with_backoff(func, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise e
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
# Usage
result = retry_with_backoff(
lambda: api_call(),
max_retries=3,
base_delay=1
)
High Availability Architecture
Multi-Region Setup
┌─────────────────────────────────────────────────────┐
│ Global Setup │
├─────────────────┬───────────────────┬───────────────┤
│ US-East │ US-West │ EU-West │
│ │ │ │
│ ┌─────────────┐ │ ┌─────────────┐ │┌─────────────┐│
│ │Load Balancer│ │ │Load Balancer│ ││Load Balancer││
│ └─────────────┘ │ └─────────────┘ │└─────────────┘│
│ ┌─────────────┐ │ ┌─────────────┐ │┌─────────────┐│
│ │App Servers │ │ │App Servers │ ││App Servers ││
│ └─────────────┘ │ └─────────────┘ │└─────────────┘│
│ ┌─────────────┐ │ ┌─────────────┐ │┌─────────────┐│
│ │ Database │ │ │ Database │ ││ Database ││
│ │ (Primary) │ │ │ (Replica) │ ││ (Replica) ││
│ └─────────────┘ │ └─────────────┘ │└─────────────┘│
└─────────────────┴───────────────────┴───────────────┘
Database High Availability
Master-Slave Replication
-- Master Database (Read/Write)
INSERT INTO users (name, email) VALUES ('John', 'john@example.com');
-- Automatically replicated to slaves
-- Slave Databases (Read-only)
SELECT * FROM users WHERE name = 'John';
Benefits:
- Read scaling
- Backup/disaster recovery
- Geographic distribution
Drawbacks:
- Write bottleneck at master
- Replication lag possible
Master-Master Replication
-- Both databases accept writes
-- Conflict resolution needed
-- Database A
UPDATE users SET email = 'john.doe@example.com' WHERE id = 1;
-- Database B (simultaneously)
UPDATE users SET email = 'j.doe@example.com' WHERE id = 1;
-- Conflict! Need resolution strategy:
-- 1. Last-write-wins
-- 2. Application-level resolution
-- 3. Manual intervention
Disaster Recovery
Recovery Metrics
RTO (Recovery Time Objective)
Maximum tolerable time to restore service after disaster.
Examples:
- Critical systems: RTO = 15 minutes
- Important systems: RTO = 4 hours
- Non-critical systems: RTO = 24 hours
RPO (Recovery Point Objective)
Maximum tolerable data loss in time.
Examples:
- Financial systems: RPO = 0 (no data loss)
- E-commerce: RPO = 15 minutes
- Analytics: RPO = 24 hours
Backup Strategies
Full Backup
# Complete backup of entire system
mysqldump --all-databases > full_backup_2024_01_01.sql
Pros: Complete restore capability
Cons: Large size, long backup time
Frequency: Weekly/Monthly
Incremental Backup
# Only changes since last backup
rsync --backup --backup-dir=backup_$(date +%Y%m%d) /data/ /backup/
Pros: Fast, smaller size
Cons: Complex restore (need all increments)
Frequency: Daily/Hourly
Differential Backup
# Changes since last full backup
tar -czf diff_backup_$(date +%Y%m%d).tar.gz --newer-mtime="2024-01-01" /data/
Pros: Faster restore than incremental
Cons: Larger than incremental
Frequency: Daily
Real-World Examples
Netflix - Chaos Engineering
Approach: Deliberately introduce failures to test resilience
Tools:
- Chaos Monkey: Randomly kills instances
- Chaos Kong: Kills entire AWS regions
- Latency Monkey: Introduces artificial delays
Results:
- 99.99% availability
- Graceful degradation during failures
- Proactive identification of weaknesses
Amazon - Cell-based Architecture
Problem: Single database serving millions of customers
Solution: Partition customers into isolated "cells"
Architecture:
Cell 1: Customers 1-100K │ Cell 2: Customers 100K-200K
┌─────────────┐ │ ┌─────────────┐
│App Servers │ │ │App Servers │
├─────────────┤ │ ├─────────────┤
│ Database │ │ │ Database │
└─────────────┘ │ └─────────────┘
Benefits:
- Failure isolation (blast radius)
- Independent scaling
- Easier troubleshooting
Google - Multi-Region Spanner
Challenge: Global consistency with high availability
Solution: TrueTime API + Paxos consensus
Architecture:
- Data replicated across multiple regions
- Synchronous replication with consensus
- External consistency guarantees
Trade-offs:
- Higher latency for writes
- Complex implementation
- Higher infrastructure costs
Monitoring & Alerting
Key Metrics
# Application Metrics
- Response time percentiles (p50, p95, p99)
- Error rates (4xx, 5xx)
- Throughput (requests/second)
- Availability percentage
# Infrastructure Metrics
- CPU/Memory utilization
- Disk space/IOPS
- Network bandwidth
- Database connections
# Business Metrics
- Revenue impact
- User experience score
- Customer support tickets
Alerting Best Practices
# Alert Fatigue Prevention
def should_alert(metric_value, threshold, consecutive_breaches=3):
"""Only alert after consecutive breaches to reduce noise"""
if metric_value > threshold:
consecutive_breaches -= 1
if consecutive_breaches <= 0:
return True
else:
consecutive_breaches = 3 # Reset counter
return False
# Alert Escalation
alert_levels = {
'warning': {'threshold': 80, 'notify': ['team-slack']},
'critical': {'threshold': 95, 'notify': ['team-slack', 'pagerduty']},
'emergency': {'threshold': 99, 'notify': ['team-slack', 'pagerduty', 'management']}
}
Best Practices
1. Design for Failure
- ✅ Assume components will fail
- ✅ Plan for graceful degradation
- ✅ Implement circuit breakers
- ✅ Use timeouts and retries
2. Measure Everything
- ✅ Application performance metrics
- ✅ Infrastructure health metrics
- ✅ Business impact metrics
- ✅ User experience metrics
3. Test Failure Scenarios
- ✅ Chaos engineering practices
- ✅ Disaster recovery drills
- ✅ Load testing under failure
- ✅ Security penetration testing
4. Automate Recovery
- ✅ Auto-healing systems
- ✅ Automated failover
- ✅ Self-scaling infrastructure
- ✅ Automated deployment rollbacks
Next Steps
- 📚 Study Performance & Latency - Metrics và optimization
- 🎯 Learn Consistency Patterns - Strong vs eventual consistency
- 🏗️ Practice Failure Scenarios - Chaos engineering
- 💻 Implement Circuit Breakers - Fault tolerance patterns
- 📖 Explore Monitoring Tools - Prometheus, Grafana, alerting
"Reliability is not about preventing failures - it's about building systems that continue working when failures inevitably occur."