System Design - Building Scalable Distributed Systems

Tổng quan

System Design là kỹ năng thiết kế và kiến trúc các hệ thống phần mềm lớn, có thể handle millions of users và petabytes of data. Đây là core skill cho Staff Engineers, Engineering Managers, và các roles senior trong tech companies.

Tại sao System Design quan trọng?

Trong công việc:

  • Architecture decisions: Thiết kế systems có thể scale
  • Technology choices: Chọn đúng tools và frameworks
  • Performance optimization: Optimize latency và throughput
  • Cost management: Balance performance vs cost
  • Team collaboration: Communicate technical decisions

Trong interviews:

  • Senior roles: Staff Engineer, Principal Engineer, Engineering Manager
  • Big Tech: Google, Meta, Amazon, Apple, Netflix, Uber
  • Unicorn startups: Airbnb, Stripe, Databricks, Snowflake
  • Problem-solving: Demonstrate architectural thinking

Core Principles

🎯 Scalability

Horizontal Scaling (Scale Out)    vs    Vertical Scaling (Scale Up)
┌─────┐ ┌─────┐ ┌─────┐                    ┌─────────────┐
│     │ │     │ │     │                    │             │
│ App │ │ App │ │ App │          →         │   Bigger    │
│     │ │     │ │     │                    │   Server    │
└─────┘ └─────┘ └─────┘                    │             │
Add more machines                           └─────────────┘
                                           Upgrade hardware

⚖️ Reliability

  • Fault tolerance: System continues khi components fail
  • Redundancy: Multiple copies of critical components
  • Graceful degradation: Reduce functionality instead of complete failure

🚀 Performance

  • Latency: Time to process single request
  • Throughput: Number of requests per second
  • Availability: Percentage of uptime (99.9% = 8.76 hours downtime/year)

📊 Consistency

  • Strong consistency: All nodes see same data simultaneously
  • Eventual consistency: System becomes consistent over time
  • CAP Theorem: Consistency, Availability, Partition tolerance - pick 2

Learning Path

🏗️ Fundamentals (Foundation)

  1. Scalability Principles - Horizontal vs Vertical scaling
  2. Reliability & Availability - CAP theorem, ACID properties
  3. Performance & Latency - Metrics, optimization techniques
  4. Consistency Patterns - Strong vs Eventual consistency

🏛️ Architecture Patterns (Design approaches)

  1. Microservices Architecture - Service decomposition, communication
  2. Event-Driven Architecture - Event sourcing, CQRS patterns
  3. Serverless Architecture - FaaS, event-driven computing
  4. Distributed Systems - Consensus algorithms, replication

🧩 System Components (Building blocks)

  1. Load Balancers - L4/L7, algorithms, health checks
  2. Caching Strategies - Redis, Memcached, CDN patterns
  3. Message Queues - RabbitMQ, Kafka, SQS patterns
  4. Databases & Storage - SQL vs NoSQL, sharding, replication

📝 Design Process (Methodology)

  1. Requirements Gathering - Functional vs Non-functional
  2. Capacity Estimation - Traffic, storage, bandwidth calculations
  3. API Design - REST, GraphQL, gRPC trade-offs
  4. Monitoring & Alerting - Metrics, logging, distributed tracing

🏗️ Case Studies (Real-world systems)

  1. URL Shortener - Bitly, TinyURL (Simple distributed system)
  2. Social Media Feed - Twitter, Facebook timeline
  3. Chat System - WhatsApp, Slack real-time messaging
  4. Search Engine - Google search indexing và ranking
  5. Video Streaming - YouTube, Netflix content delivery
  6. E-commerce Platform - Amazon, Shopify transaction processing

🚀 Advanced Topics (Expert level)

  1. Security Design - Authentication, authorization, data protection
  2. Global Scale - Multi-region, geo-distributed systems
  3. Cost Optimization - Resource planning, efficiency techniques
  4. Interview Preparation - Common questions, systematic approach

Design Interview Process

📋 Step 1: Requirements Clarification (5 mins)

❓ Questions to ask:
- Who are the users? (B2C, B2B, internal)
- What features do we need to support?
- How many users? (DAU, MAU)
- What's the scale? (reads vs writes)
- What devices? (mobile, web, desktop)

📊 Step 2: Capacity Estimation (10 mins)

📈 Calculate:
- Daily Active Users (DAU): 100M users
- Reads per day: 100M × 10 = 1B reads
- Writes per day: 100M × 1 = 100M writes
- Read QPS: 1B / 86400 = ~12K QPS
- Write QPS: 100M / 86400 = ~1.2K QPS
- Storage: 100M posts × 1KB = 100GB/day

🏗️ Step 3: High-Level Design (15 mins)

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Client    │───▶│Load Balancer│───▶│ Web Servers │
└─────────────┘    └─────────────┘    └─────────────┘
                                              │
                                              ▼
                   ┌─────────────┐    ┌─────────────┐
                   │   Cache     │◀───│ Application │
                   └─────────────┘    │   Servers   │
                                      └─────────────┘
                                              │
                                              ▼
                                      ┌─────────────┐
                                      │  Database   │
                                      └─────────────┘

🔧 Step 4: Detailed Design (15 mins)

  • Deep dive vào specific components
  • Database schema design
  • API endpoint design
  • Algorithm choices

🎯 Step 5: Scale & Optimize (10 mins)

  • Identify bottlenecks
  • Scaling strategies
  • Monitoring và alerting
  • Trade-offs discussion

Tools & Technologies

Databases:

  • SQL: PostgreSQL, MySQL, Amazon RDS
  • NoSQL: MongoDB, DynamoDB, Cassandra
  • Caching: Redis, Memcached, Amazon ElastiCache
  • Search: Elasticsearch, Amazon CloudSearch

Message Queues:

  • Traditional: RabbitMQ, Amazon SQS
  • Streaming: Apache Kafka, Amazon Kinesis
  • Pub/Sub: Google Pub/Sub, Amazon SNS

Monitoring:

  • Metrics: Prometheus + Grafana, DataDog
  • Logging: ELK Stack, Splunk
  • Tracing: Jaeger, AWS X-Ray

Cloud Platforms:

  • AWS: EC2, RDS, DynamoDB, S3, CloudFront
  • GCP: Compute Engine, Cloud SQL, BigQuery
  • Azure: Virtual Machines, Cosmos DB, CDN

Real-World Examples

Scale Examples:

  • Google: 8.5 billion searches/day
  • Facebook: 2.9 billion active users
  • YouTube: 2 billion hours watched/day
  • Amazon: 13 million orders/day
  • Netflix: 15 billion hours streamed/month

Architecture Evolution:

Monolith → SOA → Microservices → Serverless
   ↓        ↓         ↓             ↓
Single    Service   Independent   Function
App      Oriented   Services      Based

Success Metrics

Technical Metrics:

  • Latency: p95 < 200ms, p99 < 500ms
  • Throughput: 10K+ requests/second
  • Availability: 99.9%+ uptime
  • Error Rate: < 0.1% of requests

Business Metrics:

  • Cost: Infrastructure cost per user
  • Time to Market: Feature development speed
  • Developer Productivity: Deployment frequency
  • User Experience: Page load times, success rates

Next Steps

  1. 📚 Start với Fundamentals/Scalability Principles
  2. 🎯 Practice capacity estimation calculations
  3. 🏗️ Study real system architectures (Netflix, Uber tech blogs)
  4. 💻 Build mini versions của case study systems
  5. 🗣️ Practice explaining designs verbally

"Good system design is not about knowing every technology. It's about understanding trade-offs và making informed decisions."