So everybody and their dog hopefully knows about SQL injection attacks these days. Most people should have also heard someone telling them that using prepared statements is the magic super fix to all of these issues. People slightly more in the know will have read that prepared statements lead to all sorts of issues. Some of which can be fixed with hacks (or eventually at the source). Some of which can only be solved of the source also exist of course. Some others can only be fixed with certain assumptions (like using the first set of data for generating the query plan), which might break the original use of prepared statements.
Original use? I guess their original intended use had nothing to do with SQL injection attacks. I can only guess since I was not doing database when they were invented. Heck maybe I was not even born back then. My guess is that prepared statements were invented back when people had long running connections and fairly static data that at most grow slowly over time. So it did not matter to them that their query plans would remain static with prepared statements. That additional round trip to the database to prepare the query probably also did not bother them, since they only did that round trip once in a few hours (or maybe even days?). Data entered the system mostly from known friendly people. Prepared statements were good and users were happy.
Well today everybody with a clue knows that data moves fast. That applications servers are a sure fire way to make your life miserable since every little stupid mistake can bring your entire happy system down, requiring hours to rebuild your caches before operation is back to normal. Shared nothing architectures rule the web world. So connections have a short life time. Even if we ever get connection pooling, we will get worried about connections going haywire in subtle ways as they are rarely created from scratch again (the same reason why application servers are a bad idea comes to haunt us again with prepared statements). So it seems like all the reasons why prepared statements were created are no longer relevant.
Oddly enough they are more popular today than a few years ago, when people were still learning the lesson that application servers suck. The reason is that contrary to back then, today most users are potentially evil, or at least stupid enough to accidentally be evil. Or whatever .. at any rate we want to protect against SQL injection or breaking queries because the user entered a single quote where he was not supposed to. And we have learned the lesson that making it easy to code defensively is the name of the game. So we began to advocate prepared statements again. And the masses flocked to them.
However we were only advocating prepared statements because it seemed like the best tool for the job. But we realized our faultiness, so we added buggy (some more than others) client side prepared statement emulation. But the database guys did not go back to the drawing board to solve the real issue: To have some way to send the SQL statement separated from the data in a single roundtrip.
So how about it? How about inventing a syntax that will do this for us? I am sure the guys on the board for SQL2008 (does anyone even know what they are doing .. I would not even know where to link to .. neither does wikipedia) have much more useless things to cook up, so I think it will be left to the more ad hoc OSS style (LIMIT, REPLACE, ON DUPLICATE KEY UPDATE, ENUM) of extending the SQL standard. Well lets hope that this time we can maybe do this together in order to ensure that we do not end up with a situation like with LIMIT where we had some tiny weeny syntax differences in the different OSS databases. So far MySQL has been the company to drive these kinds of extensions, but my hope is that maybe with the OSDB consortium (maybe one day more than "Just a web page") this could be done at a more universal level?
Perhaps the solution is to use a client library which emulates prepared statements?
PHP's PDO, Perl's DBI and JDBC all have the capability to emulate them, which reduces the server-side overhead (i.e. to none), round trips, but retains protection from SQL injection.
This is a link to a long discussion that we had when I brought this up in my journal. If you read through the discussion we found a number of good solutions along the way:
At this point I am in favor of the bytecode solution, which keeps the caching out of the server.
PHP PDO's emulated prepared statements have problems. There are differences when using :name versus the native ? syntax of MySQL, for instance. Binding a string with a \0 with :name, and the value gets truncated, whereas with ? is just fine, is one difference.
@Brian: I will need to catch up on that discussion. But from a quick skim it seemed to revolve around parsing and caching for the most part. That is not the issue that is of any concern for me atm. For now I would be entirely happy if I the protocol could be extended (I guess your very last comment already point out this need) to be able to send a data structure with the query and all the data pieces separated. This could all then be assembled on the server. SQL injection solved. No need to rely on magic parsers/emulation on the client that will never be fool proof and add overhead. It should really not be hard to implement.
Caching just the parsed query with no caching of an actual query plan should also be doable for most RDBMS without any of the nasty side effects people are suffering from when they switch to prepared statements. I guess doing this would be hard for MySQL since the parsing and optimizer stages are not cleanly separated from what I have heard.
I seem to remember in a dim & distant past one of the Microsoft APIs ( RDO iirc, precursor to ADO) allowed you to create a prepared query without having to use the database until you executed it.
I agree with the commenter just before me; are you assuming that all prepared statements work like MySQL prepared statements?
Also, there are plenty of ways to have persistent connections and connection pooling with MySQL. MySQL's own Connector/J has connection pooling!
Honestly the problem is in how MySQL deals with prepared statements and the binding of variables. Other DBMS' don't have quite the same issues with prepared statements that MySQL does. You may want to change your post to reflect that....
Actually most RDBMS have exactly these issues. Or rather each have some of the issues listed above. For example MySQL does not even cache execution plans. So thats a big issue that MySQL does not have (though a lot of old timers would probably complain that MySQL is doing false advertising here, since most people assume that this is the key point of PS). Oracle delays generation of the query plan until the first execution (but then it does cache the query plan) since 9i IIRC. PostgreSQL differentiates between named and unnamed prepared statements that each behave slightly different.
But when it all comes down to it, all usually require 2 roundtrips to the database, where in fact the statement handle will usually only be used once in shared nothing stateless web applications. This is my main concern with prepared statements. If MSSQL server already provides some API to get SQL injection protection without 2 round trips, then all the merrier (note I never used MSSQL). Also I think I made it sufficiently clear that I am not impressed with application servers and that I imagine that connection pooling and friends will bring with it similar problems.
The primary purpose of prepared statements is to separate the overhead of syntax checking from statement execution. This is useful, for example, if you execute a statement many times in a loop. It's an ancient axiom of optimization: move the loop-invariant work outside of loop.
The other benefit of PS, to defer parameters until execute time and make it easier to resist SQL Injection, is a separate issue. There's no reason why an RDBMS couldn't allow an "execute-immediate" mode while taking parameters separately. I think this is best handled on the server, not with "emulated prepare" in the client.
But let's stick to the performance aspect of PS, which is your focus.
You seem to be assuming that making two round trips instead of one is a deal-breaker for performance. You don't need two round trips to execute a prepared statement -- you need 1+N round trips, where N is the number of times you execute the statement. By parsing the statement once, eliminating this overhead from each execution, the total cost of executing the statement N times is reduced.
How large does N have to be so that PS are efficient? It depends on how the cost of a round trip compares to the cost of the syntax parsing. In the old days, when CPU resources were at a premium, parsing was a bottleneck. Now that CPU's are so powerful, it may be that parsing is cheaper, and the round-trip is now the bottleneck.
But regardless of this, the cost of the first round-trip for the prepare fades into insignificance if your loop has enough iterations. The breakpoint at which PS is more efficient may have changed since the old days, but this is not the same thing as PS becoming "no longer relevant," as you claim.
In any case, you need to back up your claims with performance measurements. You (and Brian Aker) are advocating some pretty huge architectural changes, with no quantitative evidence to support these changes.
You also lack something important that many proposals lack: a definition of success criteria. In other words, how much do you expect to improve performance with the changes you are calling for, and how will you go about measuring the improvement so you know the change achieved the expected improvement?
Bill you totally misunderstood my point. For one my argument is that not at all about performance. My point is that PS were invented to solve a performance problem, that todays web applications do not have. They instead of the problem of highly dynamic queries that need to deal with user provided input. As such the old performance tweaks tend to hurt performance into todays world. Worse yet certain limitations (ok this is mainly relevant to MySQL) even have limitations using the old tweaks aka PS.
So I am asking for a solution that actually addresses the problem people need to have solved: SQL injection. Security should be easy. This solution should work with a single round trip and no limitations in allowed SQL constructs. I am not really of the opinion that the parsing should be done in the client, but it is an option. But I do agree with your point on Brain's blog that I rather see this done in the server.
Now for performance measurements, yes I did not provide any. I think there have been enough benchmarks to illustrate that the DB round trips are what hurts performance of todays applications the most. This includes the most trivial queries. So to me you should have instead asked me about validating my claim that most prepared statements cannot be reused in a stateless platform like LAMP. Here I am just going to claim I know. But of course someone could go through the trouble of actually examining a large number of applications. Probably checking server logs could be sufficient to see how often any given query is executed for each established connection.