ramblings on PHP, SQL, the web, politics, ultimate frisbee and what else is on in my life
back

The ugly duckling rollback

So today I asked a question in #mysql and got no reply, so I asked the same question in #postgresql and got some interesting replies. The question was if its "good practice" to rely on the RDBMS to detect constraint violations or if one should try to avoid this by running relevant SELECT's before any writes.

Now the answer I got immediately was that everybody felt it was good practice to avoid ROLLBACKs, so they all favored SELECTs. I was not so convinced initially. The fetch requires that you maintain additional code that essentially duplicates the schema definition. I was also quite obsessed about the idea that you might have to do lock the table in order to prevent concurrent requests form interfering and causing the constraint violation to still come and bite you.

Say for example you have a table that maintains who is friends with what users. You create a relationship table and put a unique constraint to only allow every combination of user-friend once. Now you have a web application which allows users to store who should be their friends you can obviously not avoid that people submit the same request twice. So you cannot really prevent on the GUI that users try to add someone as their friend, who is in fact already their friend.

So how could the code look like:

  1. SELECT 1 FROM friends WHERE user_id = ? AND friend_id = ?
    1. IF result -> return an error message
    2. ELSE INSERT INTO friends (user_id, friend_id) VALUES (?, ?)

But obviously some concurrent request could occur between the SELECT and INSERT that already does the insert (think impatient users with a trigger happy F5 key). So either you need to LOCK the entire table or still deal with a potential constraint violation. Actually you should always be prepared to handle some kind of error coming back.

So why would you still want to do the SELECT approach? Well one reason is that you might think its a good idea to do any check twice, once in the app and once in the database. Just to be sure. Another reason is that a constraint violation is just nasty, its an ugly duckling you want to avoid. Everybody in #postgresql agreed to that point.

But what about single query transactions? What about using SAVEPOINTs before running the query that could cause a constraint violation? Suddenly the picture did not seem so clear anymore.

Speaking of SAVEPOINTs I brought up the nested transaction emulation I added to MDB2 a while back. I wonder why this kind of logic is not a standard feature of RDBMS that support SAVEPOINTs?

At any rate, since nobody cared to reply to my last post *nudge* where I asked for replies, I will just leave the reader to think for himself what they feel is best :)

Comments



Re: The ugly duckling rollback

If can guarentee the constraint is going to work, I'd plow straight for the INSERT and handle the SQL error.

But why not combine into an atomic SQL instruction... something like

INSERT INTO friends(user_id, friend_id)
SELECT ?, ? WHERE NOT EXISTS(SELECT 1 FROM friends WHERE user_id = ? AND friend_id = ?)

And check affected row count if it succeeded or not.

Re: The ugly duckling rollback

On MySQL, this would do the trick:

replace into friends (user_id, friend_id) values (?, ?)

Of course, the bad side of that is that a new insert will occur and the old row will be deleted. So, there is IO there that is not needed. If there is an autoincrement involved, you will increment that.

Alternately, you could use insert .... on duplicate key update syntax. MySQL will not actually do the update in the above example because the data is the same. This will save the disk IO.

insert into friends (user_id, friend_id) values (?, ?)
on duplicate key update
user_id=values(user_id), friend_id=values(friend_id);

Re: The ugly duckling rollback

I'm surprised the postgres folks recommended the SELECT first approach. It may have been just who was in the IRC channel at the time... I wonder if you'd get the same response if you posted to the postgres mailing list?

Personally, I agree with JPW in that I perform the INSERT and deal with any errors that come back. If you do a SELECT, you have to ensure isolation from other requests, as you pointed out, so in postgres this means wrapping the SELECT and INSERT inside a transaction, and you may end up having to perform an explicit ROLLBACK anyway. Secondly, your application needs to be designed to robustly handle DB layer errors no matter what, so you may as well use them to your advantage. Thirdly, as you pointed out, the SELECT first approach is duplication of information that is already in your DB schema/constraints.

I don't see any good justification to avoid triggering a constraint violation. That's the reason they exist. It's not like it hurts the database server when one occurs. And any locks or resources that have to be gathered to check the constraint are going to be the same as the ones you'd gather in your manual check anyway...

Re: The ugly duckling rollback

PostgreSQL originally was going to implement nested transactions, but there are some corner cases / intricacies that we ran across that it was deemed to crazy to solve. Savepoints ended up winning out because they are SQL Standard compliant, and it was determined there was really no additional functionality that could be done with nested transactions that couldn't also be done with savepoints. For those curious, check the postgresql archives from around the time of the savepoint implementation (PostgreSQL 8.0 development cycle)

Ugly Ducklings?

Another reason is that a constraint violation is just nasty, its an ugly duckling you want to avoid.

Why are constraint violations nasty, ugly, and to be avoided? They exist to solve the very problem you describe, and in the vast majority of cases, much more efficiently than the query-first-then-insert approach.

Why the aversion to constraints?

Re: The ugly duckling rollback

I agree with willie, not sure why there is an aversion to constraints, and yet a willingness to use LOCKs!?

Surely LOCKs are evil, and should be avoided if possible.

Re: The ugly duckling rollback

I personally prefer to use additional select because I'd noticed that some databases returns very general and/or language-specific error message so we cannot rely on in and display it as is to end user. If we have two unique constraints for different fields in a table and then only have "insert failed" what we have to display to user? So it is better to check the database and then insert.

Re: The ugly duckling rollback

I'd go for the approach to insert and then catch the error (you'll have to catch errors anyway, it's good style). It's faster and requires less code. If using a select first, you'd still have to deal with the issue that the user could submit the data in another transaction at the very time the insert happens, so you'd still have to catch the error...

Alex@Net pointed out that some database return general/language specific error messages and/or there may be multiple constraints of which only one could've failed: Well, on the one hand there are error numbers (which of course can be general too, but not language specific) and on the other you could still do the selects after the insert to determine what went wrong.

A rollback in a larger transaction couldn't be avoided either way.

Re: The ugly duckling rollback

Also several portability layers offer mapping of error codes if SQLSTATE error codes are not natively supported.