Reliability & Availability - Building Fault-Tolerant Systems

Tổng quan

Reliability là khả năng system có thể perform correctly during specific time duration. Availability là percentage of time system operational. Hai concepts này fundamental cho building production systems có thể handle failures gracefully.

Core Definitions

Reliability

Probability system performs correctly during specific duration under specific conditions.

Reliability = MTBF / (MTBF + MTTR)

Where:
- MTBF = Mean Time Between Failures  
- MTTR = Mean Time To Repair

Availability

Percentage of time system is operational và accessible.

Availability = Uptime / (Uptime + Downtime)

Example:
- 99.9% = 8.76 hours downtime/year
- 99.99% = 52.56 minutes downtime/year

The CAP Theorem

CAP Theorem: Trong distributed system, bạn chỉ có thể guarantee 2 trong 3:

🔄 Consistency (C)

All nodes see same data simultaneously.

Strong Consistency Example:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Node A    │    │   Node B    │    │   Node C    │
│ balance:100 │ ── │ balance:100 │ ── │ balance:100 │
└─────────────┘    └─────────────┘    └─────────────┘

Write to Node A: balance = 50
↓
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Node A    │    │   Node B    │    │   Node C    │
│ balance: 50 │ ── │ balance: 50 │ ── │ balance: 50 │
└─────────────┘    └─────────────┘    └─────────────┘
All nodes updated before responding to client

🚀 Availability (A)

System remains operational despite failures.

High Availability Setup:
┌─────────────┐    ┌─────────────┐
│   Primary   │    │   Backup    │
│   Server    │───▶│   Server    │
│  (Active)   │    │ (Standby)   │
└─────────────┘    └─────────────┘

If Primary fails:
┌─────────────┐    ┌─────────────┐
│   Primary   │    │   Backup    │
│   Server    │    │   Server    │
│   (DOWN)    │    │  (Active)   │
└─────────────┘    └─────────────┘

🌐 Partition Tolerance (P)

System continues despite network partitions.

Network Partition Scenario:
┌─────────────┐         │         ┌─────────────┐
│   Node A    │ ─ ─ ─ ─ │ ─ ─ ─ ─ │   Node B    │
│             │  Network│Partition │             │
└─────────────┘         │         └─────────────┘

Nodes cannot communicate but continue serving requests

CAP Theorem Trade-offs

CP Systems (Consistency + Partition Tolerance)

Examples: MongoDB, Redis Cluster, HBase

Trade-off: Sacrifice availability during partitions
Use case: Banking systems, financial transactions

Scenario:
- Network partition occurs
- System becomes unavailable rather than serving inconsistent data
- Better to be down than wrong

AP Systems (Availability + Partition Tolerance)

Examples: Cassandra, DynamoDB, CouchDB

Trade-off: Sacrifice consistency during partitions  
Use case: Social media, content delivery

Scenario:
- Network partition occurs
- Each partition continues serving requests
- Data may be inconsistent until partition heals

CA Systems (Consistency + Availability)

Examples: Traditional RDBMS (MySQL, PostgreSQL)

Reality: Not truly distributed, single point of failure
Use case: Single-datacenter applications

Note: In distributed systems, network partitions will happen,
so true CA systems don't exist at scale

ACID Properties

⚛️ Atomicity

All operations trong transaction succeed or all fail.

-- Bank Transfer Example
BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- Sender
  UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- Receiver
COMMIT;

-- If any operation fails, entire transaction rolls back
-- No partial state: either both accounts updated or neither

🔒 Consistency

Database remains in valid state after transaction.

-- Constraint: Account balance >= 0
-- This transaction should fail:
UPDATE accounts SET balance = balance - 1000 WHERE balance = 50;
-- Violates consistency rule, transaction rejected

🔐 Isolation

Concurrent transactions don't interfere with each other.

-- Transaction A and B run simultaneously
-- Transaction A: Transfer $100 from Account 1 to Account 2
-- Transaction B: Check balance of Account 1

-- Isolation ensures Transaction B sees either:
-- 1. Balance before transfer, OR
-- 2. Balance after transfer
-- Never sees intermediate state

💾 Durability

Committed transactions persist even if system crashes.

-- After COMMIT completes:
COMMIT;
-- Data guaranteed to survive:
-- - Power failures
-- - System crashes  
-- - Hardware failures

Consistency Models

Strong Consistency

All reads receive most recent write.

Timeline:
Write X=1 ────────────────▶ All nodes updated
    │                           │
    └─ Read X returns 1 ────────┘

Pros: Simple to reason about
Cons: Higher latency, lower availability

Eventual Consistency

System will become consistent over time.

Timeline:
Write X=1 ─────┐
               ▼
Node A: X=1    │    Node B: X=0    Node C: X=0
               │         │              │
               └─────────▼──────────────▼
                    Eventually...
Node A: X=1         Node B: X=1    Node C: X=1

Pros: Higher availability, better performance
Cons: Temporary inconsistency possible

Weak Consistency

No guarantees when all nodes will be consistent.

Use cases: 
- Gaming leaderboards
- Social media likes/views
- Real-time analytics

Example: YouTube view count may vary across regions

Fault Tolerance Patterns

1. Redundancy

Active-Passive (Hot Standby)

┌─────────────┐    ┌─────────────┐
│   Primary   │───▶│   Backup    │
│  Database   │    │  Database   │ 
│  (Active)   │    │ (Standby)   │
└─────────────┘    └─────────────┘
      │                    
      ▼                    
  All Traffic              

Failover time: 30-120 seconds

┌─────────────┐    ┌─────────────┐
│  Database   │◀──▶│  Database   │
│   Node 1    │    │   Node 2    │
│  (Active)   │    │  (Active)   │
└─────────────┘    └─────────────┘
      │                    │
      ▼                    ▼
  50% Traffic          50% Traffic

Failover time: Near-instant

2. Circuit Breaker Pattern

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, func):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit breaker is OPEN")

        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e

    def on_success(self):
        self.failure_count = 0
        self.state = "CLOSED"

    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"

3. Bulkhead Pattern

Isolate critical resources to prevent cascading failures.

Thread Pool Isolation:
┌─────────────────────────────────────────────────────┐
│                Application Server                    │
├─────────────┬─────────────┬─────────────┬──────────┤
│   User      │  Payment    │  Search     │  Admin   │
│ Services    │  Services   │  Services   │ Services │
│ (Pool: 20)  │ (Pool: 10)  │ (Pool: 15)  │(Pool: 5) │
└─────────────┴─────────────┴─────────────┴──────────┘

If Payment service hangs, it only affects its pool,
other services continue functioning.

4. Retry with Exponential Backoff

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise e

            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.1)
            time.sleep(delay + jitter)

# Usage
result = retry_with_backoff(
    lambda: api_call(),
    max_retries=3,
    base_delay=1
)

High Availability Architecture

Multi-Region Setup

┌─────────────────────────────────────────────────────┐
│                   Global Setup                      │
├─────────────────┬───────────────────┬───────────────┤
│    US-East      │     US-West       │   EU-West     │
│                 │                   │               │
│ ┌─────────────┐ │ ┌─────────────┐   │┌─────────────┐│
│ │Load Balancer│ │ │Load Balancer│   ││Load Balancer││
│ └─────────────┘ │ └─────────────┘   │└─────────────┘│
│ ┌─────────────┐ │ ┌─────────────┐   │┌─────────────┐│
│ │App Servers  │ │ │App Servers  │   ││App Servers  ││
│ └─────────────┘ │ └─────────────┘   │└─────────────┘│
│ ┌─────────────┐ │ ┌─────────────┐   │┌─────────────┐│
│ │  Database   │ │ │  Database   │   ││  Database   ││
│ │  (Primary)  │ │ │ (Replica)   │   ││ (Replica)   ││
│ └─────────────┘ │ └─────────────┘   │└─────────────┘│
└─────────────────┴───────────────────┴───────────────┘

Database High Availability

Master-Slave Replication

-- Master Database (Read/Write)
INSERT INTO users (name, email) VALUES ('John', 'john@example.com');

-- Automatically replicated to slaves
-- Slave Databases (Read-only)
SELECT * FROM users WHERE name = 'John';

Benefits:
- Read scaling
- Backup/disaster recovery
- Geographic distribution

Drawbacks:
- Write bottleneck at master
- Replication lag possible

Master-Master Replication

-- Both databases accept writes
-- Conflict resolution needed

-- Database A
UPDATE users SET email = 'john.doe@example.com' WHERE id = 1;

-- Database B (simultaneously)  
UPDATE users SET email = 'j.doe@example.com' WHERE id = 1;

-- Conflict! Need resolution strategy:
-- 1. Last-write-wins
-- 2. Application-level resolution
-- 3. Manual intervention

Disaster Recovery

Recovery Metrics

RTO (Recovery Time Objective)

Maximum tolerable time to restore service after disaster.

Examples:
- Critical systems: RTO = 15 minutes
- Important systems: RTO = 4 hours  
- Non-critical systems: RTO = 24 hours

RPO (Recovery Point Objective)

Maximum tolerable data loss in time.

Examples:
- Financial systems: RPO = 0 (no data loss)
- E-commerce: RPO = 15 minutes
- Analytics: RPO = 24 hours

Backup Strategies

Full Backup

# Complete backup of entire system
mysqldump --all-databases > full_backup_2024_01_01.sql

Pros: Complete restore capability
Cons: Large size, long backup time
Frequency: Weekly/Monthly

Incremental Backup

# Only changes since last backup
rsync --backup --backup-dir=backup_$(date +%Y%m%d) /data/ /backup/

Pros: Fast, smaller size
Cons: Complex restore (need all increments)
Frequency: Daily/Hourly

Differential Backup

# Changes since last full backup
tar -czf diff_backup_$(date +%Y%m%d).tar.gz --newer-mtime="2024-01-01" /data/

Pros: Faster restore than incremental
Cons: Larger than incremental
Frequency: Daily

Real-World Examples

Netflix - Chaos Engineering

Approach: Deliberately introduce failures to test resilience

Tools:
- Chaos Monkey: Randomly kills instances
- Chaos Kong: Kills entire AWS regions
- Latency Monkey: Introduces artificial delays

Results:
- 99.99% availability
- Graceful degradation during failures
- Proactive identification of weaknesses

Amazon - Cell-based Architecture

Problem: Single database serving millions of customers
Solution: Partition customers into isolated "cells"

Architecture:
Cell 1: Customers 1-100K    │  Cell 2: Customers 100K-200K
┌─────────────┐              │  ┌─────────────┐
│App Servers  │              │  │App Servers  │
├─────────────┤              │  ├─────────────┤
│  Database   │              │  │  Database   │
└─────────────┘              │  └─────────────┘

Benefits:
- Failure isolation (blast radius)
- Independent scaling
- Easier troubleshooting

Google - Multi-Region Spanner

Challenge: Global consistency with high availability
Solution: TrueTime API + Paxos consensus

Architecture:
- Data replicated across multiple regions
- Synchronous replication with consensus
- External consistency guarantees

Trade-offs:
- Higher latency for writes
- Complex implementation
- Higher infrastructure costs

Monitoring & Alerting

Key Metrics

# Application Metrics
- Response time percentiles (p50, p95, p99)
- Error rates (4xx, 5xx)
- Throughput (requests/second)
- Availability percentage

# Infrastructure Metrics  
- CPU/Memory utilization
- Disk space/IOPS
- Network bandwidth
- Database connections

# Business Metrics
- Revenue impact
- User experience score
- Customer support tickets

Alerting Best Practices

# Alert Fatigue Prevention
def should_alert(metric_value, threshold, consecutive_breaches=3):
    """Only alert after consecutive breaches to reduce noise"""
    if metric_value > threshold:
        consecutive_breaches -= 1
        if consecutive_breaches <= 0:
            return True
    else:
        consecutive_breaches = 3  # Reset counter
    return False

# Alert Escalation
alert_levels = {
    'warning': {'threshold': 80, 'notify': ['team-slack']},
    'critical': {'threshold': 95, 'notify': ['team-slack', 'pagerduty']},
    'emergency': {'threshold': 99, 'notify': ['team-slack', 'pagerduty', 'management']}
}

Best Practices

1. Design for Failure

✅ Assume components will fail
✅ Plan for graceful degradation
✅ Implement circuit breakers
✅ Use timeouts and retries

2. Measure Everything

✅ Application performance metrics
✅ Infrastructure health metrics
✅ Business impact metrics
✅ User experience metrics

3. Test Failure Scenarios

✅ Chaos engineering practices
✅ Disaster recovery drills
✅ Load testing under failure
✅ Security penetration testing

4. Automate Recovery

✅ Auto-healing systems
✅ Automated failover
✅ Self-scaling infrastructure
✅ Automated deployment rollbacks

Next Steps

📚 Study Performance & Latency - Metrics và optimization
🎯 Learn Consistency Patterns - Strong vs eventual consistency
🏗️ Practice Failure Scenarios - Chaos engineering
💻 Implement Circuit Breakers - Fault tolerance patterns
📖 Explore Monitoring Tools - Prometheus, Grafana, alerting

"Reliability is not about preventing failures - it's about building systems that continue working when failures inevitably occur."