The other day a discussion in #mysql on freenode developed concering possible issues arising from using auto increment on primary key fields due to the danger of hotspots. Since I am currently preparing a new talk first to be held at php|works on SQL performance tuning I thought that the topic would make for a perfect warming up post on the general topic.
I guess before diving into an actual discussion of the topic we first need to define what a hotspot in RDBMS lingo is. "SQL Performance Tuning" defines it as follows:
"A page in either the index or data file that every job wants to access at the same time".
Now this sentence tells us that hotspots are concerned with concurrency issues. So this indicates to us that the entire topic is probably not much of an issue when we have few users accessing the database at the same time. As a matter of fact the entire issue will likely matter to the select few of you who have a database that is getting pounded quite severely.
Defining a page
Since not every reader will know exact that a page is either here is a definiton taken from the same above mentioned book:
"A fixed-size hopper that stores rows of data or index keys; a minimal unit for disk I/O, Depending on the DBMS, a page is also called a data block, a block, a blocking unit, a control interval, or a fow group."
As a matter of fact what some people see as a hotspot, others will see as a cache hit. Something most people would agree is a good thing (tm)! However trying to playing with the same page. The key in the above definiton of page lies in the plural "rows". If a page contains multiple rows and its the smallest unit of I/O it also means that in terms of I/O this is the most finely grained thing we can get a lock on (notice I said "in terms of I/O" becaue obviously DBMS can do stuff like row level locking). So if one process gets an exclusive lock on a page other processes who are also trying to get an upate or exclusive lock will have to wait!
Enter: clustered keys
A common such scenario is an INSERT statement. Usually the new row will be inserted on the last page. Multiple INSERT's are therefore likely to collide. There is an exception to this likelyhood and that is a clustered index. Interestingly enough MySQL puts a clustered index on primary key fields. However before we proceed and keeping in the theme of defining not so common RDBMS lingo, the following defines a clustered index:
"An index that the DBMS uses to determine the order of data rows, according to values in one or more columns, called the cluster key."
Note: Since MySQL usually rebuilds the entire table when an ALTER TABLE is done, MySQL will also reorder the table if modifications are made on the primary key.
Clustered keys are not our saviour
So if a clustered index is used, its not readily obvious where new data is stored. It depends entire on the values used for the cluster key of the to be inserted row. However when we are using autoincrement on our primary key column we generate sequential integers. Suddenly it becomes obvious once again where data is stored: on the last page! This sends us back to the situation where concurrent INSERTs will get slowed down due to locking.
But clustered keys are not our enemy either
Before we proceed I would like to briefly explain the rational of making primary keys clustered keys. The advantage of using a clustered key come into play if we are likely to perform any kind of range searches on the columns inside the clusteed key, like we do when ORDER BY or GROUP BY. The assumption that MySQL makes is that this kind of stuff is quite likely to happen on a primary key. Other DBMS make this behaviour optional like PostGreSQL for example. I am sure eventually MySQL will get this ability as well.
Dispelling the myth that unique identifiers are sequential by nature
Ok lets get back to the topic at hand: Having lots of INSERTs on a table with a autoincrement primary key column might lead to hotspots. Now this would be a very depressing blog post if I would not atleast mention a few ways out of this situation. Unfortunately though to some extend the only solution is to battle one of the fundamental advantages of a clustered key: the fact that things get neatly ordered.
The solution usually consists of making the generated unique identifiers non sequential. Yes its ok to do so although my good friend Arjen has given in to the swarms of users who insist that identifiers should be generated sequentially. However this does not defeat the purpose of a clustered key entirely, because even doing range searches does not imply that you always want the data returned in the order that they were inserted. In other words depending on the scenario we may still benefit from the increase performance of range searches, even if we do not get the side effect that data is ordered by the time it was inserted.
The bad "solution"
So how do we get non sequential unique ids? One method would be to use a global id generator. This means we use a single table with an autoincrement field which generates the unique identifiers for all tables. This solution is far from ideal though. Since MySQL does not have true sequences yet we will only get this data in a rather clumsy way (updating the table and fetching the value). Even worse it will only really generate somewhat non squential identifiers if we insert data into lots of different tables before inserting into the same table again. However since the DBMS might internally find ways to handle things in parallel, we might still end up with page locks. So as a matter of fact lets immediatly forget this solution.
However keeping with the idea of a global generator we could instead use something a bit more random, yet still unique. For example we could use several pieces of information like the current date and time along with some other data. This does not gurantee a truely unique identifier however and since we are talking heavy traffic here we might run into issues. Making the giving idenfifier wide enough should reduce the risk though.
A nifty solution
A pretty nifty trick is to store the sequentially generated identifer in reverse. This however does not work quite as cleanly as one could hope in MySQL due to the lack of true sequences again. So in this case we would once again have to emulate sequences with a table containing a single autoincrementing column. Oracle on the other hand provides the ability to return sequences in reverse order out of the box.
A good solution
The final solution is to simply add another piece of information into the identifier. This could be a piece of information that would be stored into the row like a username in a user table. However since everybody knows its not a good idea to change identifiers since that messes with referntial integrity and screws with our indexes. So in case this is done the field should be duplicated. The duplicate should only be used inside the primary key and should remain fixed.
Puh, this ended up being a much longer blog post than I expected. I guess I will just post it and read through it tomorrow to spot if I made any mistakes or missed points that needed more details. Since I do not yet have comments enabled on my blog I would appreciate any comments by mail.