Database Indexes and Transactions: What actually happens under the hood
Database Indexes and Transactions: What actually happens under the hood
You've used indexes. You've wrapped queries in transactions. But if someone asked you why a missing index makes your transaction lock half the table, would you have a clear answer?
Most backend engineers treat databases as black boxes. You add an index when a query is slow, you use transactions when you need atomicity, and you move on. That works until it doesn't. Until your production database has a deadlock at 3 AM, or a simple UPDATE is taking 30 seconds because it's locking rows it shouldn't be touching.
Let's fix that. We'll go from the bottom up.
What an index actually is
Imagine a table called bookings with 10 million rows. You run:
SELECT * FROM bookings WHERE hotel_id = 42;Without an index, MySQL does a full table scan. It reads every single row, checks if hotel_id equals 42, and keeps going until it's been through all 10 million. That's O(n). Slow, expensive, and it gets worse as the table grows.
An index is a separate data structure that keeps the values of one or more columns sorted, with pointers back to the actual rows on disk. In MySQL with InnoDB, this structure is a B+ Tree.
How the B+ Tree works
Think of it as a tree where each node holds multiple keys in sorted order. The internal nodes only exist to direct the search. The leaf nodes at the bottom contain the actual values and pointers to rows.
[50, 100] ← root node
/ | \
[10,30] [60,80] [120,150] ← internal nodes
/ | \ / | \ / | \
leaves with the actual values ← leaf nodes (doubly linked)When you search for hotel_id = 42, the engine starts at the root. 42 is less than 50, so it goes left. Then 42 is greater than 30, so it goes to the right child. It reaches the leaf node, finds the value, and follows the pointer to the real row on disk.
That's O(log n). With 10 million rows, instead of 10 million comparisons you do about 3 or 4 node reads. The difference is massive.
One important detail: the leaf nodes of a B+ Tree are linked to each other in a doubly linked list. This is what makes range queries like WHERE price BETWEEN 100 AND 200 so efficient. Once you land on the first matching leaf, you just walk the linked list to the right until you're out of range. No need to go back up the tree.
Clustered vs secondary indexes
InnoDB has two kinds of indexes, and understanding the difference matters.
The clustered index is the table itself. In InnoDB, the entire table is physically organized as a B+ Tree sorted by the primary key. The leaf nodes of the clustered index ARE the actual data rows. Every InnoDB table has exactly one clustered index. If you don't define a primary key, InnoDB creates a hidden one internally.
A secondary index is any other index you create. Its leaf nodes store the indexed column value plus the primary key. When MySQL finds what it's looking for in a secondary index, it then has to do a second lookup in the clustered index to get the full row. This is called a bookmark lookup or double lookup.
CREATE INDEX idx_hotel ON bookings(hotel_id);
-- MySQL finds hotel_id=42 in the secondary index → gets the PKs →
-- goes to the clustered index for each PK to fetch the complete rowThere's one exception to this double lookup: covering indexes. If your secondary index includes all the columns that the query needs, MySQL can answer the query entirely from the index without ever touching the clustered index. This is the fastest possible read path.
CREATE INDEX idx_cover ON bookings(hotel_id, check_in, price);
SELECT check_in, price FROM bookings WHERE hotel_id = 42;
-- Resolved entirely from the secondary index. No double lookup.The trade-offs of indexes
Reads get dramatically faster. You go from scanning the entire table to a few tree hops. Range queries benefit from the linked leaf structure. Covering indexes skip the table entirely.
But writes pay the price. Every time you do an INSERT, UPDATE (on an indexed column), or DELETE, MySQL has to write the row to the clustered index AND update every secondary index that involves that column. That means rebalancing B+ Trees, potentially splitting nodes, and moving data around on disk. If you have 5 secondary indexes on a table, a single INSERT results in 6 write operations. More indexes also means more disk space, since each index is essentially a partial copy of your data.
The practical rule: index columns that appear frequently in WHERE clauses and JOINs, but don't go overboard. Every index you add speeds up reads and slows down writes. Find the balance for your workload.
Cardinality: not all indexes are equal
Cardinality is how many distinct values a column has. A role column with 4 possible values (admin, user, moderator, guest) has cardinality 4. An email column has cardinality close to the total number of rows.
An index is more useful when the column has high cardinality. If you index role on a table with 1 million rows, each value in the index points to roughly 250,000 rows. MySQL finds the "admin" branch in the tree quickly, but then has to process a quarter million rows. Still better than a full scan, but not very selective.
Index email instead, and each value points to exactly 1 row. That's a perfect lookup.
Whether the column uses a MySQL ENUM type or a VARCHAR that happens to have few values doesn't matter for the index. What matters is the actual cardinality. An ENUM is smaller on disk (stored internally as an integer) and prevents garbage values, but the index behavior is the same. If someone fills a VARCHAR column with thousands of distinct strings, the cardinality goes up and the index becomes more selective, which is actually good for the index but bad for your data model.
Composite indexes and the leftmost prefix rule
You can create an index on multiple columns:
CREATE INDEX idx ON orders(user_id, status);This builds a B+ Tree that sorts first by user_id, and within each user_id by status. This is ideal for queries that filter by user_id and optionally also by status.
But there's a critical rule: the leftmost prefix rule. The index can only be used efficiently if your query uses the columns starting from the left. A query that filters only by status without user_id cannot use this index efficiently, because status is the second column. It's like trying to look up someone in a phone book by first name without knowing their last name.
-- Can use the index (user_id, status):
WHERE user_id = 5
WHERE user_id = 5 AND status = 'accepted'
-- Cannot use it efficiently:
WHERE status = 'accepted'Transactions: what they are and how they work
A transaction is a group of operations that execute atomically: all succeed or all fail. The classic example is a bank transfer.
START TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;If anything goes wrong between the two UPDATEs, the entire transaction rolls back and both accounts stay unchanged. No money disappears into the void.
InnoDB implements this with two mechanisms under the hood.
First, MVCC (Multi-Version Concurrency Control). Every row in InnoDB has hidden fields: a transaction ID (which transaction last modified it) and a pointer to the undo log (where previous versions of the row are stored). When a transaction reads data, it doesn't see the current version of the row. It sees a consistent snapshot based on when the transaction started. This is what allows reads to not block writes and vice versa.
Second, WAL (Write-Ahead Logging). Before modifying data on disk, InnoDB writes the changes to a sequential redo log first. If the server crashes mid-COMMIT, it replays the redo log on restart to recover the state. This guarantees durability.
How locking actually works
This is the part that confuses most people. InnoDB locks at the row level, not the table level. But the details matter.
When you do UPDATE accounts SET balance = 100 WHERE id = 5, InnoDB puts an exclusive lock (X lock) on just that one row. Other transactions can read and write other rows without any issue.
But here's the key question: how does InnoDB know which rows to lock?
The answer is that there's no anticipation. The engine doesn't figure out which rows match your WHERE clause first and then lock them. It locks rows AS it scans, in real time.
With an index (e.g. primary key): The engine navigates the B+ Tree, going down 3 or 4 nodes to reach the leaf containing id = 5. The intermediate nodes are NOT locked with row locks. They're just navigation pointers. Only when it arrives at the actual row does it place a lock. This is instant and surgical.
With a secondary index: Say you run UPDATE bookings SET price = 200 WHERE status = 'pending' and there's an index on status. The engine navigates the secondary index to the "pending" branch, finds the matching entries (say 10 rows), and locks each one as it finds it. First match, lock. Second match, lock. The tree navigation nodes are not row-locked.
Without an index: This is where things get dangerous. Without an index, MySQL has no map. It has to do a full table scan, and InnoDB locks every row it examines, not just the ones that match. If the table has 1000 rows and only 10 are 'pending', the engine goes row by row, locks each one, checks if it matches, and if it doesn't, releases the lock. But during that brief moment, each row was locked, and any other transaction trying to write to those rows has to wait.
A missing index doesn't just make your query slow. Inside a transaction, it makes your query lock rows it doesn't even care about, causing contention across the entire table. Indexes aren't just about read performance. They're about lock precision.
Why intermediate tree nodes aren't locked
The internal nodes of a B+ Tree are just routing keys. They don't contain actual row data (no status, no balance, nothing from your business domain). They exist only to tell the engine "go left or go right." Locking them as rows would be meaningless.
InnoDB does protect the tree structure during modifications using internal latches. These are extremely brief locks (microseconds) that prevent the tree from becoming corrupted during node splits or rebalancing. But they're transparent to you and to your transactions. They're not the same as row locks, and they don't cause contention in any practical sense.
So the flow is: navigating the tree uses latches (microseconds, invisible). Reaching the actual row uses a row lock (held until COMMIT).
The lock types you should know about
InnoDB has more granular lock types than just "lock the row":
A record lock locks a specific index record. A gap lock locks the space between two index values, preventing other transactions from inserting into that gap. A next-key lock combines both: it locks the record AND the gap before it. This is the default lock type in REPEATABLE READ.
-- Table with IDs: 10, 20, 30
-- Transaction A:
SELECT * FROM t WHERE id BETWEEN 15 AND 25 FOR UPDATE;
-- InnoDB places next-key locks on:
-- gap (10, 20] and gap (20, 30]
-- This locks row id=20 AND the gaps around it
-- If Transaction B tries to INSERT id=17, it blocksGap locks exist to prevent phantom reads: situations where a second query in the same transaction returns rows that didn't exist during the first query.
Isolation levels in 30 seconds
READ UNCOMMITTED lets you see uncommitted data from other transactions. Almost never used.
READ COMMITTED shows only committed data, but if you read the same query twice you might get different results.
REPEATABLE READ (the MySQL default) takes a snapshot at the start of the transaction. All reads are consistent against that snapshot. Gap locks prevent phantom reads.
SERIALIZABLE is maximum consistency, minimum performance. Everything gets serialized. Shared locks on everything you read.
How JOINs benefit from indexes
JOINs combine data from multiple tables. Without indexes, they become extremely expensive.
SELECT u.name, o.id, o.total
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.role = 'admin' AND o.status = 'accepted';MySQL executes this in stages. First it filters users where role = 'admin'. If there's an index on users.role, it finds them quickly. Then for each admin user, it looks up their orders by orders.user_id. If there's an index on orders.user_id (there will be if it's a foreign key, since InnoDB creates indexes for foreign keys automatically), it goes straight to the right rows. Then it filters by orders.status.
You can't create an index that spans two tables. Each index lives in its table. But you can create composite indexes within a single table. An index like CREATE INDEX idx ON orders(user_id, status) would be ideal here because MySQL can find the orders for a specific user_id AND filter by status in a single tree traversal.
Foreign keys, by the way, are essentially constraints plus indexes. InnoDB automatically creates an index on the foreign key column if one doesn't already exist, because it needs to verify referential integrity on every insert and delete. Without that index, those integrity checks would be full table scans.
Transaction trade-offs
Keep transactions as short as possible. The longer a transaction stays open, the more locks it holds, and the more it blocks other transactions. Long transactions also grow the undo log because InnoDB has to maintain old row versions for MVCC snapshots.
Deadlocks happen when Transaction A locks row 1 and waits for row 2, while Transaction B locks row 2 and waits for row 1. InnoDB detects this automatically and rolls back one of the two transactions. To minimize deadlocks, try to access rows in a consistent order across transactions.
Without explicit START TRANSACTION, MySQL treats each statement as its own transaction (autocommit mode). The database handles locks and WAL internally for each individual statement. Explicit transactions are for when YOU need multiple operations to be atomic at the business level.
The full picture
Indexes and transactions are not separate topics. They're deeply connected.
An index makes reads faster by turning full table scans into tree lookups. But it also makes transactions safer by allowing locks to be precise. Without an index, a transactional UPDATE can end up locking rows it doesn't care about, causing contention and deadlocks across the table.
When you design your schema, think about both dimensions. What queries will run against this table? Those need indexes for speed. What writes will happen inside transactions? Those need indexes for lock precision. The columns in your WHERE clauses, JOIN conditions, and foreign keys are where indexes matter most.
The cost is on the write side. Every index you add is another tree that needs updating on every INSERT, UPDATE, and DELETE. More indexes means slower writes and more disk. Find the balance for your workload, and let your query patterns guide you.
- Indexes are B+ Trees that turn O(n) scans into O(log n) lookups.
- The clustered index IS the table, sorted by primary key. Secondary indexes store values + PKs.
- Indexes speed up reads but slow down writes. Each index is an extra write per mutation.
- Transactions lock rows AS they scan, not before. With an index, locks are surgical. Without one, they can lock the whole table.
- Tree navigation nodes use brief internal latches, not row locks. Only the final rows get locked until COMMIT.
- Foreign keys automatically create indexes. JOINs benefit heavily from indexes on join columns.
- Composite indexes follow the leftmost prefix rule. Column order matters.
- Keep transactions short. Missing indexes in transactional queries cause widespread locking and deadlocks.