Database - Comprehensive Interview Questions

This document provides a comprehensive list of database interview questions, ranging from fundamental concepts to advanced, real-world scenarios. It is designed to be a one-stop-shop for preparing for technical interviews.

📝 I. Fundamental Concepts

1. General

Q: What is a database? Why is it necessary? - Definition: An organized, structured collection of data. - Necessity: Ensures data integrity, consistency, security, and allows for concurrent access, efficient querying, and reliable backup/recovery.

Q: What is the difference between SQL and NoSQL?

Aspect	SQL (Relational)	NoSQL (Non-relational)
Schema	Predefined, fixed (e.g., `CREATE TABLE`)	Dynamic, flexible (e.g., JSON documents)
Data Model	Tables with rows and columns	Key-value, document, column-family, graph
Scalability	Primarily Vertical (Scale-up: more CPU/RAM)	Primarily Horizontal (Scale-out: more servers)
Consistency	Strong, ACID-compliant by default	Tunable (Eventual vs. Strong)
Joins	Native, efficient `JOIN` operations	Limited or no joins (client-side or `$lookup`)
Use Cases	Financial systems, transactional data, analytics	Big data, real-time apps, content management

2. ACID Properties

Q: What are the ACID properties? Explain each. - Atomicity: "All or nothing." A transaction must be fully completed or not at all. If one part fails, the entire transaction is rolled back. - Consistency: A transaction must bring the database from one valid state to another. All integrity constraints must be maintained. - Isolation: Concurrent transactions should not interfere with each other. The result of multiple concurrent transactions is the same as if they were executed sequentially. - Durability: Once a transaction is committed, its changes are permanent, even in the case of a system failure.

3. Normalization

Q: What is database normalization and why is it important? - Definition: The process of organizing columns and tables in a relational database to minimize data redundancy. - Importance: Reduces redundancy, prevents data anomalies (insert, update, delete anomalies), and improves data integrity.

Q: Explain 1NF, 2NF, and 3NF. - 1NF (First Normal Form): Ensures that table cells hold atomic values (no repeating groups or multi-valued columns). - 2NF (Second Normal Form): Must be in 1NF. All non-key attributes must be fully functionally dependent on the entire primary key. (Removes partial dependencies). - 3NF (Third Normal Form): Must be in 2NF. All attributes must be dependent only on the primary key, not on other non-key attributes. (Removes transitive dependencies).

🔧 II. Core Concepts & Internals

1. Indexing

Q: What is an index and how does it work? - Definition: A data structure (commonly a B-Tree) that improves the speed of data retrieval operations on a database table. - How it works: It creates a sorted copy of one or more columns with pointers back to the original table rows. This allows the database to find data much faster than scanning the entire table (Full Table Scan).

Q: What is the difference between a Clustered and a Non-Clustered Index?

Feature	Clustered Index	Non-Clustered Index
Physical Order	Determines the physical order of data in a table.	Has a separate structure from the data rows.
Quantity	Only one per table.	Many per table.
Structure	The leaf nodes of the index are the data rows.	The leaf nodes contain pointers to the data rows.
Performance	Faster for range queries and queries on the clustered key.	Can be faster for queries that don't involve the clustered key.

Q: What is a Covering Index? - A covering index is a type of non-clustered index that includes all the columns required to satisfy a query (both in the WHERE clause and the SELECT list). - This is extremely efficient because the database can answer the query by only reading the index, without ever having to touch the actual table data (an "index-only scan").

Q: When should you not use an index? - On tables with very high write (INSERT/UPDATE/DELETE) and low read frequency. - On very small tables where a full table scan is already fast. - On columns with low cardinality (few unique values, e.g., a gender column).

2. Transactions & Concurrency

Q: What are the common concurrency problems? Explain them. - Dirty Read: A transaction reads data that has been modified by another transaction but has not yet been committed. - Non-Repeatable Read: A transaction reads the same row twice and gets different values because another transaction modified that row in between the reads. - Phantom Read: A transaction runs the same query twice and gets a different set of rows because another transaction inserted or deleted rows that match the query's WHERE clause.

Q: How do isolation levels prevent these problems?

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read
Read Uncommitted	Allowed	Allowed	Allowed
Read Committed	Prevented	Allowed	Allowed
Repeatable Read	Prevented	Prevented	Allowed
Serializable	Prevented	Prevented	Prevented

3. Database Internals

Q: Explain the difference between InnoDB and MyISAM storage engines in MySQL.

Feature	InnoDB	MyISAM
Transactions	Yes (ACID Compliant)	No
Locking	Row-level locking	Table-level locking
Foreign Keys	Yes	No
Crash Recovery	Yes (via transaction log)	No
Use Case	General purpose, transactional (OLTP)	Read-heavy, simple workloads (e.g., data warehousing)

Q: How does a query optimizer work? What is an EXPLAIN plan? - Query Optimizer: A component of the RDBMS that determines the most efficient way to execute a SQL query. It considers factors like available indexes, table statistics, and join strategies to create an execution plan. - EXPLAIN Plan: The output of the query optimizer. It shows the steps the database will take to execute the query, such as which indexes it will use, the order of joins, and the estimated number of rows to be examined. It is the primary tool for diagnosing and optimizing slow queries.

Q: What is MVCC (Multi-Version Concurrency Control)? - Definition: A concurrency control method used by databases like PostgreSQL and MySQL (InnoDB) to avoid locking issues. - How it works: When a row is updated, the database doesn't overwrite the old data. Instead, it creates a new version of the row. Each transaction is given a "snapshot" of the database at a certain point in time. - Benefit: Readers don't block writers, and writers don't block readers. This significantly improves concurrency compared to traditional read/write locking.

🚀 III. Advanced Topics & System Design

1. Scaling Strategies

Q: What is the difference between Vertical and Horizontal Scaling? - Vertical Scaling (Scale Up): Increasing the resources of a single server (e.g., adding more CPU, RAM, or faster SSDs). It's simple but has physical limits and can be expensive. - Horizontal Scaling (Scale Out): Adding more servers to a pool of resources. It's more complex but offers better fault tolerance and can scale almost infinitely.

Q: What is database partitioning? Explain different strategies. - Definition: The process of splitting a large table into smaller, more manageable pieces (partitions), while still being treated as a single table by the SQL query layer. - Strategies: - Range Partitioning: Partitions are based on a range of values (e.g., by date). - List Partitioning: Partitions are based on a list of discrete values (e.g., by region or country). - Hash Partitioning: A hash function is applied to a column to determine which partition the row belongs to, ensuring an even data distribution.

Q: How does partitioning differ from sharding? - Partitioning is a feature within a single database instance. The database is aware of the partitions and routes queries automatically. - Sharding is a practice of splitting data across multiple, independent database instances. The application logic is typically responsible for knowing which shard holds which data. Sharding is a form of horizontal scaling.

2. Database Comparisons

Q: When would you choose PostgreSQL over MySQL? - PostgreSQL: - When you need to handle complex queries and have more advanced data types (e.g., JSONB, PostGIS for geospatial data). - When you require stricter SQL compliance and extensibility (custom functions, types). - For data warehousing and analytical workloads where complex joins are common. - MySQL: - When you prioritize simplicity, ease of use, and a large community. - For read-heavy web applications where performance and replication are key. - When you are building on top of a popular open-source stack like LAMP.

Q: Compare the consistency models of MongoDB and a traditional SQL database. - SQL Database: Strong consistency by default (ACID). A read is guaranteed to see the result of the most recent committed write. This can come at the cost of higher latency or lower availability in distributed systems. - MongoDB: Tunable consistency. - For single-server deployments, it provides strong consistency. - In a replica set, it defaults to eventual consistency for reads from secondary nodes, but can be configured for strong consistency by directing reads to the primary. This flexibility allows developers to trade off consistency for lower latency and higher availability based on the application's needs.

3. System Design Scenarios

Q: How would you design the database for a social media feed? - Considerations: Read-heavy workload, real-time updates, personalized content. - Approach: - Hybrid SQL + NoSQL: - SQL (e.g., PostgreSQL) for user data, relationships (follows), and post metadata (user_id, timestamp). This maintains data integrity. - NoSQL (e.g., Redis) for the feed itself. When a user posts, a background job pushes the post ID to the Redis lists (feeds) of all their followers. - Fan-out on write: This approach pre-computes the feeds. When a user opens their app, you just read a simple list from Redis, which is extremely fast. - Trade-offs: This uses more storage but provides very fast reads. For celebrities with millions of followers, a "fan-out on read" approach might be used instead.

Q: How would you handle a "hotspot" in a sharded database? - Definition: A hotspot is a single shard that receives a disproportionately high amount of traffic, creating a bottleneck. - Solutions: - Re-sharding: Change the sharding key or the sharding strategy. For example, if sharding by customer_id and one customer is generating all the traffic, you might need to shard by a different key like order_id. - Isolate the Hotspot: Move the "hot" data to its own dedicated shard or even its own dedicated database server. - Application-level Caching: Heavily cache the data that is causing the hotspot to reduce the load on the database.

This guide covers a wide range of topics essential for database interviews. Good luck!