Why indices are good/bad
With databases, you have the option of putting an index on one or more columns, and it might not be apparent (apart from the quite inaccurate comparison with the index in a book) what that actually does.
Why indices are good
To start with, I’m going to describe the worst-case scenario, and it’s pretty simple: imagine you have a million rows which are in no particular order as far as the column you’re looking for is concerned. Even if reading each row is very quick (say 1 ms or a thousandth of a second), something is going to have to read through every single row to find the ones that match, and that adds up quickly (to a horrifying 1000 seconds in this case), so there’s a need to make that operation as fast as possible.
A simple approach to this problem would be to have a separate list of all the values (in sorted order) for the column, pointing to the corresponding row number, much like an index in a book. As this is shorter than the actual table, it’s faster to navigate (let’s say 0.1 ms per row in this case), but you do still have to read through, on average, half the rows to find your starting point, and that’s still a lot of time (500,000 x 0.1ms = 50 seconds).
To make that better, you could use an optimised seek technique, such as a binary search. To do that, you start by picking the point in the middle, work out whether that’s lower or higher than the value you’re looking for, based on that result consider the current point to be the top or bottom of your new search range, then pick the point in the middle again and continue. Since the list is sorted, that would produce a consistent result, and even if jumping back and forth was very slow (say 100 ms or a tenth of a second), it only takes a small number of iterations since you’re halving the number of items to check at each step (in this example, 20 * 100ms = 2 seconds). So that would be great for reading, and that is conceptually quite similar to how database servers read their indices.
Unfortunately, such an approach would be terrible for write performance, because adding or removing a row would on average require re-writing half the index (on the order of 50 seconds each time in this case), so in reality something a bit more complex is used: a b-tree. Instead of a straight list, what you basically get is, recursively, and index of indices, so the first entry might say at what location in the tree you’d find the index for 1-500,000 and the next entry might say where to find 500,001-1,000,000. If you followed one of those, you might find an entry saying where to find 1-250,000 and another saying where to find 250,001-500,000. This way you can find the entry you’re looking for in a fairly short amount of time, and deleting or inserting entries would only involve writing to the same number of entries as you would read, in the common case (on the order of 2 seconds in this example).
Getting the results in the right order, and gathering them together
An index can do more than speed up your search: what if you want some results sorted by a particular column? The database server can’t simply remember which order to present them in. To give you a sorted list, it actually needs to construct it by making a temporary table including all of the results found, and then sorting that. This basically means it needs to read each result row several times (for comparison with other result rows) and write each result row at least once. Depending on the number of result rows in question, this can be a very time-consuming operation.
Grouping by a particular column is a very similar problem: the database server can’t (well, can’t always) just remember the current output state, which means that it really wants the results to be sorted so that it can deal with them one output-row at a time.
So, if an index isn’t being used to generate the result list (or if it’s the same index, or if it would be less efficient to use it), the database server will use an index for sorting, simply reading through the index in order (taking an amount of time comparable to reading through the whole result set once) and potentially sending the client each row as it is found (or accumulating data and occasionally sending the client a row in the case of a GROUP BY).
Why indices are bad
Having an index isn’t a uniformly good thing, or you’d simply have them turned on for all columns all the time.
Firstly, if you have very few rows it might actually take longer to read the index than to read every row; of course, that would be a very small performance penalty in general, so not a big problem.
Secondly, more indices means more data to write each time, thus slower inserts (if every column had an index, you could expect well over double write time).
Thirdly, if the database server picks the wrong index, performance could move much closer to the time it would take with no indices (or even much more!). Consider if you’d put an index on a column which only ever has the value 1 or 0, and where 99% of the rows have the value 0 for that column. If you only generally ask for rows with the value 1, that’s great; if you ask for rows with the value 0 along with some other constraint, however, the database server will have to gather the complete list of matching rows, possibly construct a temporary table containing all of those rows (taking roughly as long as copying the entire table), and read each row looking for the other constraint involved. This is particularly a risk where you have an index on a column with little variation and have no index on another column which varies widely.
How to make indices good
There are plenty of times where a normal index just wouldn’t help that much. Consider:
SELECT b,c, COUNT(*) FROM my_table WHERE a=1 GROUP BY c,b ORDER BY b;
That’s the kind of query which (if you had an index on each column in the query) would make the database server start having to really guess which index to use – and none of them would be likely to be great. The server would want to, in order, limit the rows to consider (WHERE a=1), sort for grouping (GROUP BY c,b) and then re-sort again (ORDER BY b) – so what you’d really want is to have a sub-index for each one. That isn’t practical, but what if you had an index of each column, put together, in order?
If you had a joint index on (a,b,c), the server would be able to search for the first point where a=1, the rows are already pre-grouped which makes it very easy to gather the data for each result row, they’re already ordered by b also, and it can just read forwards in the index until a=1 is no longer true.
Query optimiser
While you’re adding indices to make your queries faster, the database server has to decide which (if any) to use. It’ll note which indices relate to the columns in the query, note how long each index key is, possibly even read part of some indices to see how many rows the result set would be narrowed down to, or how many rows would be expected in the output set. In the end, if you give it a choice of keys, it’ll usually do a good job of picking the right one, so you don’t need to worry too much about having the perfect key specification. If you want to ask MySQL how it intends to handle a query, you can use the EXPLAIN keyword, e.g.:
mysql> EXPLAIN SELECT * FROM my_table WHERE a=1;
+—-+————-+———-+——+—————+——+———+——+——+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+———-+——+—————+——+———+——+——+————-+
| 1 | SIMPLE | my_table | ALL | k1,k2 | NULL | NULL | NULL | 25 | Using where |
+—-+————-+———-+——+—————+——+———+——+——+————-+
1 row in set (0.00 sec)
mysql>
As a rule of thumb, anything in the “Extra” box is an indication that the query can’t be handled very efficiently, although the main thing to note is the “rows” count: if that’s in the thousands (or more!) then you probably could do with better indices.
Tags: SQL






Reseller Hosting
Feature-rich reseller hosting with unlimited domains, space and bandwidth. Enjoy full customisation and exclusive discounts with our reseller hosting package.