Scalability Principles - Building Systems that Grow
Tổng quan
Scalability là khả năng của hệ thống có thể handle increased load bằng cách thêm resources vào system. Đây là fundamental principle trong system design, đặc biệt quan trọng khi building applications cho millions of users.
Định nghĩa Scalability
Scalability = Ability to handle increased load gracefully
Load có thể là:
- Users: Increased concurrent users
- Data: Growing database size
- Requests: More API calls per second
- Transactions: Higher transaction volume
- Storage: Increased storage requirements
Types of Scaling
🔺 Vertical Scaling (Scale Up)
Thêm power vào existing machine bằng cách upgrade hardware.
Before Scaling: After Scaling:
┌─────────────┐ ┌─────────────┐
│ 4 cores │ │ 16 cores │
│ 8 GB RAM │ → │ 64 GB RAM │
│ 100 GB SSD │ │ 1 TB SSD │
└─────────────┘ └─────────────┘
Single Server Bigger Server
Advantages:
- ✅ Simple: No code changes required
- ✅ No complexity: Single machine to manage
- ✅ Data consistency: No distributed system issues
- ✅ ACID compliance: Database ACID properties maintained
Disadvantages:
- ❌ Hardware limits: CPU, RAM có giới hạn vật lý
- ❌ Single point of failure: Nếu machine fail, entire system down
- ❌ Expensive: High-end hardware costs exponentially more
- ❌ Downtime: Upgrades require system shutdown
When to use:
- Legacy applications không thể distributed
- Small to medium applications
- When development team lacks distributed systems expertise
- Strict consistency requirements
↔️ Horizontal Scaling (Scale Out)
Thêm more machines vào pool of resources.
Before Scaling: After Scaling:
┌─────────────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ │ │ │ │ │ │ │ │ │
│ Single │ → │ Web │ │ Web │ │ Web │ │ Web │
│ Server │ │ Srv │ │ Srv │ │ Srv │ │ Srv │
│ │ │ 1 │ │ 2 │ │ 3 │ │ 4 │
└─────────────┘ └─────┘ └─────┘ └─────┘ └─────┘
One Machine Multiple Machines
Advantages:
- ✅ No limits: Theoretically unlimited scaling
- ✅ Fault tolerant: One machine fails, others continue
- ✅ Cost effective: Commodity hardware is cheaper
- ✅ Incremental: Add machines as needed
Disadvantages:
- ❌ Complexity: Distributed system challenges
- ❌ Data consistency: CAP theorem limitations
- ❌ Code changes: Applications cần designed for distribution
- ❌ Network overhead: Inter-machine communication
When to use:
- Large-scale applications (millions of users)
- When high availability is critical
- Cost optimization is important
- Team has distributed systems expertise
Scalability Patterns
1. Load Distribution Patterns
Load Balancing
┌─────────────┐
Requests ───▶│Load Balancer│
└─────┬───────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Server 1 │ │ Server 2 │ │ Server 3 │
└─────────────┘ └─────────────┘ └─────────────┘
Algorithms: - Round Robin: Requests distributed evenly - Least Connections: Send to server with fewest active connections - Weighted: Different servers handle different load percentages - IP Hash: Route based on client IP
Horizontal Partitioning (Sharding)
Users Database:
Shard 1 (Users A-H): Shard 2 (Users I-P): Shard 3 (Users Q-Z):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Alice │ │ John │ │ Robert │
│ Bob │ │ Kate │ │ Sarah │
│ Charlie │ │ Lisa │ │ Tom │
│ David │ │ Mike │ │ Victor │
└─────────────┘ └─────────────┘ └─────────────┘
Sharding Strategies: - Range-based: Partition by value ranges - Hash-based: Use hash function cho distribution - Directory-based: Lookup service maps keys to shards - Geographic: Partition by location
2. Caching Patterns
Cache Hierarchy
Client Request Flow:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Browser │───▶│ CDN │───▶│ App Server │───▶│ Database │
│ Cache │ │ Cache │ │ Cache │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Fastest Fast Medium Slowest
Cache Types: - Browser Cache: Static assets (CSS, JS, images) - CDN Cache: Geographic distribution - Application Cache: In-memory caching (Redis, Memcached) - Database Cache: Query result caching
Cache Strategies:
# Cache-Aside (Lazy Loading)
def get_user(user_id):
# Try cache first
user = cache.get(f"user:{user_id}")
if user is None:
# Cache miss - get from database
user = database.get_user(user_id)
# Store in cache for future
cache.set(f"user:{user_id}", user, ttl=3600)
return user
# Write-Through
def update_user(user_id, user_data):
# Update database
database.update_user(user_id, user_data)
# Update cache immediately
cache.set(f"user:{user_id}", user_data, ttl=3600)
# Write-Behind (Write-Back)
def update_user(user_id, user_data):
# Update cache immediately
cache.set(f"user:{user_id}", user_data, ttl=3600)
# Database update happens asynchronously
async_queue.add_task('update_db', user_id, user_data)
3. Database Scaling Patterns
Read Replicas
┌─────────────┐
Write Requests ─│ Master │
│ Database │
└─────┬───────┘
│ Replication
┌───────────┼───────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
Read │ Replica 1 │ │ Replica 2 │ │ Replica 3 │
───▶ │ Database │ │ Database │ │ Database │
└─────────────┘ └─────────────┘ └─────────────┘
Benefits: - Scale read operations - Reduce load on master database - Geographic distribution possible - Backup/disaster recovery
Database Partitioning
# Vertical Partitioning (by columns)
Users Table Split:
┌─────────────┐ ┌─────────────┐
│ user_id │ │ user_id │
│ username │ │ email │
│ first_name │ │ phone │
│ last_name │ │ address │
└─────────────┘ └─────────────┘
Basic Info Table Contact Table
# Horizontal Partitioning (by rows)
Users Table Split:
┌─────────────┐ ┌─────────────┐
│ Users 1-1M │ │ Users 1M-2M │
│ user_id │ │ user_id │
│ username │ │ username │
│ email │ │ email │
└─────────────┘ └─────────────┘
Partition 1 Partition 2
Scalability Metrics
Performance Metrics
Latency vs Throughput
Latency = Time to process single request
Throughput = Number of requests processed per time unit
Example:
- Latency: 100ms per request
- Throughput: 1000 requests/second
Higher throughput often means higher latency!
Response Time Percentiles
Response Time Distribution:
- p50 (median): 100ms (50% of requests < 100ms)
- p95: 200ms (95% of requests < 200ms)
- p99: 500ms (99% of requests < 500ms)
- p99.9: 1000ms (99.9% of requests < 1000ms)
Focus on p95, p99 rather than average!
Availability Metrics
Uptime Calculation:
99.9% availability = 8.76 hours downtime/year
99.99% availability = 52.56 minutes downtime/year
99.999% availability = 5.26 minutes downtime/year
Nines Table:
90% = 36.5 days/year downtime
99% = 3.65 days/year downtime
99.9% = 8.76 hours/year downtime
99.99% = 52.56 minutes/year downtime
99.999% = 5.26 minutes/year downtime
Real-World Scaling Examples
Twitter - Tweet Timeline
Problem: Generate timeline for 300M users
Solution: Multiple scaling approaches
1. Pull Model (Early days):
User requests → Generate timeline from following list
Issues: Slow for users with many followings
2. Push Model (Current):
Tweet created → Push to all followers' pre-computed timelines
Issues: Celebrities with millions of followers
3. Hybrid Model:
Normal users: Push model
Celebrities: Pull model on-demand
Optimization based on follower count
Netflix - Video Streaming
Scaling Challenges:
- 200M+ subscribers globally
- 15 billion hours watched/month
- 4K video requires 25 Mbps bandwidth
Solutions:
1. CDN Network:
- Content cached in 1000+ locations globally
- 95% of traffic served from edge locations
2. Microservices:
- 700+ microservices
- Independent scaling per service
3. Auto-scaling:
- Scale based on real-time demand
- Predictive scaling for popular content releases
Instagram - Image Sharing
Scale: 1 billion users, 95 million photos/day
Database Sharding Strategy:
- Shard by user_id using consistent hashing
- 4000 database shards
- Each shard handles ~250K users
Photo Storage:
- Original photos: Amazon S3
- Processed thumbnails: CDN
- Metadata: Cassandra database
Scaling Antipatterns
❌ Premature Optimization
Wrong: "Let's use microservices from day 1!"
Right: Start với monolith, extract services when needed
❌ Over-Engineering
Wrong: Build for 1M users when you have 100 users
Right: Scale incrementally based on actual demand
❌ Ignoring Bottlenecks
Wrong: Add more web servers when database is bottleneck
Right: Identify và address actual bottlenecks first
❌ No Monitoring
Wrong: Scale without understanding current performance
Right: Monitor metrics before making scaling decisions
Best Practices
1. Design for Scaling
- ✅ Stateless applications
- ✅ Database connection pooling
- ✅ Asynchronous processing
- ✅ Circuit breaker pattern
2. Monitor Everything
- ✅ Application metrics
- ✅ Infrastructure metrics
- ✅ Business metrics
- ✅ User experience metrics
3. Scale Incrementally
- ✅ Start simple, scale when needed
- ✅ Measure before optimizing
- ✅ Test scaling strategies
- ✅ Plan for peak traffic
4. Prepare for Failure
- ✅ Design for fault tolerance
- ✅ Graceful degradation
- ✅ Circuit breakers
- ✅ Chaos engineering
Next Steps
- 📚 Study Reliability & Availability - CAP theorem, ACID properties
- 🎯 Learn Performance Optimization - Latency, throughput metrics
- 🏗️ Practice Capacity Planning - Estimation calculations
- 💻 Implement Caching - Redis, CDN strategies
- 📖 Explore Database Scaling - Sharding, read replicas
"Scalability is not about building for millions of users day one. It's about building systems that can evolve và grow gracefully as demand increases."