I have been quite vocal at every opportunity about how surrogate keys suck. However I must admit, as vocal as I may be, I have not really been good at working towards not using them myself. So briefly what are surrogate keys? They are usually those meaningless integer identifiers you see everywhere. Yes like in the url to this very blog post. People have come to love them however, especially people that use tools to generate SQL for them. But general wisdom tells people that everything is faster with integers. I keep hearing the same story over and over from self proclaimed MySQL, PostgreSQL or whatever kind of RDBMS they are flying experts. Index reads, joins and of course development since most tools are optimized (or should we rather say oversimplified things) towards integer surrogate keys. Now I keep repeating the same complaints, that surrogate keys are usually not faster in any meaningful way, that they sometimes can actually be slower due to unnecessary joins and that they are meaningless to end users and developers alike (always fun when trying to read log files of queries full of surrogate keys).
Now I have something new to the list, surrogate keys are also more error prone in development. Today I discovered a major security (and functional) bug that went unnoticed for several weeks (fortunately only in a development version). Essentially I was using the wrong surrogate key to do checks inside the ACL layer. Unfortunately this wrong surrogate key made ok sense in the development machine (in testing most surrogate keys tend to be within a very limited range, as all generators start at 1). Now with natural keys, the chances are that this bug would have been spotted much earlier, as natural keys tend to look very different for different types of data and therefore this incorrect use of an identifier would have simply produced no results when filtering by it. This would have been quickly spotted and fixed. With my surrogate keys it made no difference if the identifier was 1 of type A, eventhough I was filtering by type B, where I should have also used 1. Think about that guys and girls as you happily chuck away on your code leveraging this awesome DBAL/ORM combo that only does surrogate keys for identifiers.
I think surrogate keys are a great thing What if you use on the customerstable the customernumber as the primary key. And a few months later it's decided that instead of 6 characters, the field has to be 7 characters. Most of RDBMS don't allow the primary key to be altered. That means in such cases, the table has to be dropped and recreated. If you use surrogate keys, no problems like that will occur.
Well, happy usage of natural keys for you. Till someone decides, that natural key should change.
To avoid problems with synthetic (surrogate) keys, you could of course set up foreign key relationships between tables.
I find synthetic keys to be a "natural" way to mark a row as unique, and something that will remain the same should the natural keys need to change throughout the life of the schema. With a schema with a large number of tables with complex relationships, I really don't see a cleaner way of maintaining integrity other than the use of surrogate keys, foreign key relationships, and transaction processing.
The only time I would use a natural key directly is when a table has more of a definitional role than a data role, for instance, in a table that defines all the states in the union, I would use the 2-letter state code as the key, and it makes perfect sense to do so. But if you are talking about a membership table, say, with millions of entries, it makes more sense to have a synthetic integer key to mark each row as unique, especially since natural keys can collide, such as the names of the members or even email addresses!
All in all, it all comes down to what makes the best sense given the circumstances.
Also, there seems to be a lot of "penny-wise, pound-foolish" arguments with regards to performance. Performance is a complicated topic and goes far beyond whether integers as keys are "faster" than natural keys on some limited benchmark. It is all to easy to blog about "X is faster than Y" but in the real world many parameters must be examined. X may actually be slightly slower on a benchmark, but may be faster in overall practice under real-world load situations with the appropriate tweaks to my.cnf and the like.
A good primary key needs to be unique and immutable. It's usually not hard to find a unqiue natural key but from my experience it's not that easy to find one that is immutable enough. That said, i always first try to look for a good natural key that is unique & immutable for the reasons you pointed out but i rarely find one that i consider immutable and then i take the surrogate key. I agree that it's a failure to always go with surrogate keys in all cases without even investigating the possibilities since a good natural key really has advantages (you already named them several times).
The main thing that bugs me about surrage keys is when they're put on a many-to-many table.
Id int PRIMARY KEY, FooId int, BarId int
The most pointless thing ever.
In my experience the problem with surrogate keys is when you start putting them in code to query by. It is one thing to use them as a unique identifier for updates, but you should never see 'select * from foo where type_id = 9' or some such foolishness.
The point of being immutable falls apart when you find them in code. Any select you are using from code should be mutable. What I mean by that is that you should never see a hard coded reference to a value that the database made up, and could be a different number if deployed again to a different db. SO, my point to my development team is: never use autonum IDs in code, and only use 'natural' keys as FK when it makes sense (or is natural). Also 'natural' keys need to be properly optimized - all the same length, etc...
So I agree with the state example. That is perfectly natural. But being absolutely one or the other only shows lack of experience.
"The main thing that bugs me about surrage keys is when they're put on a many-to-many table.
The most pointless thing ever."
I can see why you arrive at that conclusion, but I don't agree. In many cases, what starts as 'just' an intersection table grows attributes of it's own. Before you know it, it is an entity of its own. Buiness requirements change and suddenly you find you actually need to reference a row from that table. So what to do?
Personally, I can see why natural keys are appealing, and I used to use them all the time, but gradually I found it is so much more practical to obtain and maintain a consistent design by simply and systematically always adding generated technical keys. Especially when you have composite natural keys and foreign keys, stuff gets out of control very easily when we suddenly find out that the natural key was actually too narrow or in general when we have to somehow change it.
And, yes, the natural keys do get proper UNIQUE constraints. To me that is no contradiction or duplication at all. I use the technical primary key to build foreign keys. The unicity constraint for the natural key is simply a business rule that needs to be enforced, but in itself that is a different issue from identifying the object. An extra performance hit for another index is not really as much as an issue as maintaining unicity.
There is currently on case where I could consider natural keys, and that is when the underlying data base uses a clustered index. If the queries are such that we can define an efficient key that can benefit from the ordered row storage than I'd be tempted to use that rather than a technical key.
This always seems to be the central point people make about why surrogate keys are better.
Whoever said that a key cannot change? Whoever said that a primary key must be immutable?
CONSTRAINT fk_foo FOREIGN KEY (bar) REFERENCES baz(bar) ON UPDATE CASCADE
I can change baz.bar all day long without any ill effects whatsoever.
So what is the real argument?
"I can change baz.bar all day long without any ill effects whatsoever."
Well, I think there's two things to consider.
1) Changing data. If you have cascadeing foreign key constraints, you couldchange the value of one of the columns in the referenced key, and the change would cascade and propagate through to all the foreign key children (and the children of the children and so on and so forth)
The disadvantage here is performance. This cascade will take time, and may lock up the system. In fact if the system is busy, the update of the key may never get a chance to complete because it times out waiting to obtain a lock to do the rather large update.
Of course for many (not too big and not too busy) systems this won't be that much of a problem, but the fact is that this is less scalable than an immutable key.
2) Structural changes Sometimes it turns out that the structure of the key is wrong. This may be because of a design flaw but it is more likely to occur because of a new business requirement.
For example, You start with an inventory of DVD's for your DVD rental shop, and the key may be the DVD title. Then you grow and acquire a second shop. So now we need to have the key of the shop as well as the key of the DVD to identify an item in the inventory.
Of course, the table with foreign keys to the inventory table, lets say, DVD rentals, will no also need to undergo a structural change. (and the children thereof, and so on and so forth)
So, again we see the IMO rather undesirable effect that a seemingly local change affects other parts of the system, possible many other parts.
You could argue that a decent modelling tool will take the responsibility for this extra work - I however prefer to argue that a technical, immutable key would have been quicker to design in the first place, and not caused any of these problems ever.
(of course at the expense of a more difficult debugging process like Lukas mentioned in the original post)