Case Study: URL Shortener (như Bit.ly)

Tổng Quan

URL Shortener là service chuyển đổi URLs dài thành URLs ngắn, dễ chia sẻ. Đây là case study kinh điển trong system design interviews.

Requirements

Functional Requirements

1. Shorten long URLs to short URLs
2. Redirect short URLs to original URLs
3. Custom aliases (optional)
4. URL expiration
5. Analytics và click tracking

Non-Functional Requirements

- Scale: 100M URLs created per month
- Read:Write ratio = 100:1
- Availability: 99.9%
- Latency: < 100ms for redirects
- URL length: 6-7 characters

Capacity Estimation

Storage Requirements

class URLShortenerCapacity:
    def __init__(self):
        self.urls_per_month = 100_000_000
        self.years_retention = 5
        self.avg_url_size = 500  # bytes

    def calculate_storage(self):
        total_urls = self.urls_per_month * 12 * self.years_retention
        storage_needed = total_urls * self.avg_url_size
        return {
            'total_urls': total_urls,
            'storage_tb': storage_needed / (1024**4)
        }

Traffic Estimation

Write QPS: 100M / (30 * 24 * 3600) = ~40 URLs/second
Read QPS: 40 * 100 = 4000 redirects/second
Peak traffic: 2x average = 8000 QPS

Database Design

Schema Design

CREATE TABLE urls (
    id BIGINT PRIMARY KEY,
    short_url VARCHAR(7) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP,
    expires_at TIMESTAMP,
    click_count INT DEFAULT 0
);

CREATE INDEX idx_short_url ON urls(short_url);
CREATE INDEX idx_user_id ON urls(user_id);

Sharding Strategy

# Consistent hashing for database sharding
class URLSharding:
    def __init__(self, num_shards=10):
        self.num_shards = num_shards

    def get_shard(self, short_url):
        hash_value = hashlib.md5(short_url.encode()).hexdigest()
        return int(hash_value, 16) % self.num_shards

URL Encoding Algorithm

Base62 Encoding

class Base62Encoder:
    def __init__(self):
        self.alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
        self.base = len(self.alphabet)

    def encode(self, number):
        if number == 0:
            return self.alphabet[0]

        result = []
        while number > 0:
            result.append(self.alphabet[number % self.base])
            number //= self.base

        return ''.join(reversed(result))

    def decode(self, encoded):
        number = 0
        for char in encoded:
            number = number * self.base + self.alphabet.index(char)
        return number

Counter-based vs Hash-based

# Counter-based approach
class CounterBasedGenerator:
    def __init__(self):
        self.counter = 0
        self.encoder = Base62Encoder()

    def generate_short_url(self):
        self.counter += 1
        return self.encoder.encode(self.counter)

# Hash-based approach  
class HashBasedGenerator:
    def __init__(self):
        self.encoder = Base62Encoder()

    def generate_short_url(self, long_url):
        hash_value = hashlib.md5(long_url.encode()).hexdigest()
        number = int(hash_value[:8], 16)
        return self.encoder.encode(number)

System Architecture

High-Level Design

Client → Load Balancer → API Gateway → Application Servers → Database
                                    → Cache (Redis)
                                    → Analytics Service

API Design

POST /api/v1/shorten
{
  "long_url": "https://example.com/very/long/url",
  "custom_alias": "my-link",
  "expires_at": "2024-12-31T23:59:59Z"
}

Response:
{
  "short_url": "https://bit.ly/abc123",
  "long_url": "https://example.com/very/long/url",
  "created_at": "2024-01-15T10:30:00Z"
}

GET /abc123 → 302 Redirect to original URL

Caching Strategy

Multi-Level Caching

class URLCache:
    def __init__(self):
        self.local_cache = {}  # Application-level cache
        self.redis = Redis()   # Distributed cache

    def get_url(self, short_url):
        # Check local cache first
        if short_url in self.local_cache:
            return self.local_cache[short_url]

        # Check Redis cache
        long_url = self.redis.get(f"url:{short_url}")
        if long_url:
            self.local_cache[short_url] = long_url
            return long_url

        # Fallback to database
        return self.get_from_database(short_url)

Analytics và Monitoring

Click Tracking

class ClickTracker:
    def track_click(self, short_url, request_info):
        # Async tracking to not block redirect
        self.queue.put({
            'short_url': short_url,
            'timestamp': datetime.utcnow(),
            'ip': request_info.remote_addr,
            'user_agent': request_info.user_agent,
            'referer': request_info.referer
        })

Real-time Analytics

# Stream processing for real-time analytics
class AnalyticsProcessor:
    def process_click_stream(self, click_event):
        # Update counters
        self.increment_click_count(click_event['short_url'])

        # Geographic analysis
        location = self.geo_lookup(click_event['ip'])
        self.track_geographic_stats(click_event['short_url'], location)

        # Device analysis
        device_info = self.parse_user_agent(click_event['user_agent'])
        self.track_device_stats(click_event['short_url'], device_info)

Scalability Considerations

Read Replicas

class DatabaseRouter:
    def __init__(self):
        self.master = MasterDB()
        self.replicas = [ReplicaDB1(), ReplicaDB2(), ReplicaDB3()]

    def write(self, query):
        return self.master.execute(query)

    def read(self, query):
        replica = random.choice(self.replicas)
        return replica.execute(query)

CDN Integration

- Cache popular short URLs at CDN edge
- Reduce latency for global users
- Handle 302 redirects at edge locations

Security Considerations

Rate Limiting

class RateLimiter:
    def is_allowed(self, user_ip, action):
        if action == 'create':
            return self.check_limit(user_ip, max_creates=10, window=3600)
        elif action == 'redirect':
            return self.check_limit(user_ip, max_redirects=1000, window=3600)

Spam Prevention

class SpamDetector:
    def is_spam(self, long_url, user_info):
        # Check against known malicious domains
        if self.is_malicious_domain(long_url):
            return True

        # Check user behavior patterns
        if self.detect_suspicious_pattern(user_info):
            return True

        return False

Performance Optimization

Database Optimization

-- Partitioning by creation date
CREATE TABLE urls_2024_01 PARTITION OF urls
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Read replicas for analytics queries
CREATE VIEW url_analytics AS
SELECT 
    DATE(created_at) as date,
    COUNT(*) as urls_created,
    SUM(click_count) as total_clicks
FROM urls 
GROUP BY DATE(created_at);

Disaster Recovery

Backup Strategy

- Daily full database backups
- Real-time replication to secondary region
- Point-in-time recovery capability
- Regular backup restoration testing

Next Steps

Nội dung này sẽ được mở rộng thêm với: - Advanced caching patterns - Global distribution strategies - Custom domain support - API versioning strategies - Microservices decomposition