Case Study: URL Shortener (như Bit.ly)
Tổng Quan
URL Shortener là service chuyển đổi URLs dài thành URLs ngắn, dễ chia sẻ. Đây là case study kinh điển trong system design interviews.
Requirements
Functional Requirements
1. Shorten long URLs to short URLs
2. Redirect short URLs to original URLs
3. Custom aliases (optional)
4. URL expiration
5. Analytics và click tracking
Non-Functional Requirements
- Scale: 100M URLs created per month
- Read:Write ratio = 100:1
- Availability: 99.9%
- Latency: < 100ms for redirects
- URL length: 6-7 characters
Capacity Estimation
Storage Requirements
class URLShortenerCapacity:
def __init__(self):
self.urls_per_month = 100_000_000
self.years_retention = 5
self.avg_url_size = 500 # bytes
def calculate_storage(self):
total_urls = self.urls_per_month * 12 * self.years_retention
storage_needed = total_urls * self.avg_url_size
return {
'total_urls': total_urls,
'storage_tb': storage_needed / (1024**4)
}
Traffic Estimation
Write QPS: 100M / (30 * 24 * 3600) = ~40 URLs/second
Read QPS: 40 * 100 = 4000 redirects/second
Peak traffic: 2x average = 8000 QPS
Database Design
Schema Design
CREATE TABLE urls (
id BIGINT PRIMARY KEY,
short_url VARCHAR(7) UNIQUE NOT NULL,
long_url TEXT NOT NULL,
user_id BIGINT,
created_at TIMESTAMP,
expires_at TIMESTAMP,
click_count INT DEFAULT 0
);
CREATE INDEX idx_short_url ON urls(short_url);
CREATE INDEX idx_user_id ON urls(user_id);
Sharding Strategy
# Consistent hashing for database sharding
class URLSharding:
def __init__(self, num_shards=10):
self.num_shards = num_shards
def get_shard(self, short_url):
hash_value = hashlib.md5(short_url.encode()).hexdigest()
return int(hash_value, 16) % self.num_shards
URL Encoding Algorithm
Base62 Encoding
class Base62Encoder:
def __init__(self):
self.alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
self.base = len(self.alphabet)
def encode(self, number):
if number == 0:
return self.alphabet[0]
result = []
while number > 0:
result.append(self.alphabet[number % self.base])
number //= self.base
return ''.join(reversed(result))
def decode(self, encoded):
number = 0
for char in encoded:
number = number * self.base + self.alphabet.index(char)
return number
Counter-based vs Hash-based
# Counter-based approach
class CounterBasedGenerator:
def __init__(self):
self.counter = 0
self.encoder = Base62Encoder()
def generate_short_url(self):
self.counter += 1
return self.encoder.encode(self.counter)
# Hash-based approach
class HashBasedGenerator:
def __init__(self):
self.encoder = Base62Encoder()
def generate_short_url(self, long_url):
hash_value = hashlib.md5(long_url.encode()).hexdigest()
number = int(hash_value[:8], 16)
return self.encoder.encode(number)
System Architecture
High-Level Design
Client → Load Balancer → API Gateway → Application Servers → Database
→ Cache (Redis)
→ Analytics Service
API Design
POST /api/v1/shorten
{
"long_url": "https://example.com/very/long/url",
"custom_alias": "my-link",
"expires_at": "2024-12-31T23:59:59Z"
}
Response:
{
"short_url": "https://bit.ly/abc123",
"long_url": "https://example.com/very/long/url",
"created_at": "2024-01-15T10:30:00Z"
}
GET /abc123 → 302 Redirect to original URL
Caching Strategy
Multi-Level Caching
class URLCache:
def __init__(self):
self.local_cache = {} # Application-level cache
self.redis = Redis() # Distributed cache
def get_url(self, short_url):
# Check local cache first
if short_url in self.local_cache:
return self.local_cache[short_url]
# Check Redis cache
long_url = self.redis.get(f"url:{short_url}")
if long_url:
self.local_cache[short_url] = long_url
return long_url
# Fallback to database
return self.get_from_database(short_url)
Analytics và Monitoring
Click Tracking
class ClickTracker:
def track_click(self, short_url, request_info):
# Async tracking to not block redirect
self.queue.put({
'short_url': short_url,
'timestamp': datetime.utcnow(),
'ip': request_info.remote_addr,
'user_agent': request_info.user_agent,
'referer': request_info.referer
})
Real-time Analytics
# Stream processing for real-time analytics
class AnalyticsProcessor:
def process_click_stream(self, click_event):
# Update counters
self.increment_click_count(click_event['short_url'])
# Geographic analysis
location = self.geo_lookup(click_event['ip'])
self.track_geographic_stats(click_event['short_url'], location)
# Device analysis
device_info = self.parse_user_agent(click_event['user_agent'])
self.track_device_stats(click_event['short_url'], device_info)
Scalability Considerations
Read Replicas
class DatabaseRouter:
def __init__(self):
self.master = MasterDB()
self.replicas = [ReplicaDB1(), ReplicaDB2(), ReplicaDB3()]
def write(self, query):
return self.master.execute(query)
def read(self, query):
replica = random.choice(self.replicas)
return replica.execute(query)
CDN Integration
- Cache popular short URLs at CDN edge
- Reduce latency for global users
- Handle 302 redirects at edge locations
Security Considerations
Rate Limiting
class RateLimiter:
def is_allowed(self, user_ip, action):
if action == 'create':
return self.check_limit(user_ip, max_creates=10, window=3600)
elif action == 'redirect':
return self.check_limit(user_ip, max_redirects=1000, window=3600)
Spam Prevention
class SpamDetector:
def is_spam(self, long_url, user_info):
# Check against known malicious domains
if self.is_malicious_domain(long_url):
return True
# Check user behavior patterns
if self.detect_suspicious_pattern(user_info):
return True
return False
Performance Optimization
Database Optimization
-- Partitioning by creation date
CREATE TABLE urls_2024_01 PARTITION OF urls
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
-- Read replicas for analytics queries
CREATE VIEW url_analytics AS
SELECT
DATE(created_at) as date,
COUNT(*) as urls_created,
SUM(click_count) as total_clicks
FROM urls
GROUP BY DATE(created_at);
Disaster Recovery
Backup Strategy
- Daily full database backups
- Real-time replication to secondary region
- Point-in-time recovery capability
- Regular backup restoration testing
Next Steps
Nội dung này sẽ được mở rộng thêm với: - Advanced caching patterns - Global distribution strategies - Custom domain support - API versioning strategies - Microservices decomposition