ramblings on PHP, SQL, the web, politics, ultimate frisbee and what else is on in my life
back

Database schema evolution

I gave a talk on this a while ago and I thought I should revisit the topic once more. I still do not have a final solution, but some things have become more clear in my head. At least I feel that it would be a good idea to write my thoughts down in an attempt to see the level of clarity I have attained on this topic. With this disclaimer out of the lets dive into the topic.

Introduction

There are two different approaches which I will call the "migrations" (I chose the name "migrations" since its essentially what Rails migrations are all about, even though they did not necessarily invent this approach) and the "MDB2_Schema" approach from now on. What it boils down to is that instead of applying your schema changes directly on your database, you write a script to make the necessary changes (optionally with another script to undo the given change). The MDB2_Schema approach on the other hand is basically about defining your schema and using some smart logic to infer the necessary changes when comparing two versions of the schema. So lets get into the details about the pros and cons.

Migrations

Advantages

With the migration approach you get all the flexibility in the world. You can mix and match DDL and DML. You can choose to write portable SQL (potentially leveraging the API of a portability layer) or you can write very RDBMS specific SQL. Best of all since you write your scripts in a specific order, you are guaranteed that when applying the changes in the same order on the deployment system, you will handle all dependencies (like having to create the parent table before pointing to it via a foreign key from another).

Disadvantages

The drawback is that you have to write a lot of code even for common tasks, although you can ease this a little with some helper functions. More over you have to execute every change script in order one after another. This can slow down deployment and can even lead to some issues when you undo some changes where you drop some data in a later version. This can be solved by doing a full dump at incremental stages to make it possible to use those for new installs, but it does not solve the undoing of changes issues really. Finally you have all that additional work to write the undo scripts in order to support downgrading versions. They are also likely to get very little testing.

MDB2_Schema

Advantages

This approach shines in pretty much all areas where migrations suck. Since you are just comparing schema definitions, you can essentially jump directly from one version to another, skipping intermediate version steps. You also do not need to write any code, you just update your schema file and you are done. As such you obviously also do not need to write downgrade scripts. Portability is ensured since the actual SQL is generated by a portability layer.

Disadvantages

There is currently no way to properly handle DML. While there is support for basic DML statements, where is no way to handle the ordering of statements so that you can ensure that your DDL and DML is handled in a specific order. The handling of things like foreign key creation will hopefully be solved as part of the Google SoC this summer. Hopefully as part of this we might also see a way that we can sort of attach certain DML to be done before and after specific DDL. This would probably require adding a schema level version tag. This way you might be able to say: execute this DML (or script) before doing the alterations on table X when upgrading from version A to B. No idea yet how that would really play out. Today the only approach that is available is to write a script that is executed before and after the automatic DDL generation. A final problem with this approach is that you are obviously limited to what DDL the MDB2_Schema format supports, though I think we should look into adding the ability to add RDBMS specific plain text partial SQL statements in various places of the generated SQL.

Conclusion

Both approaches have their advantages and their disadvantages. The interesting aspect to me is that both approaches can pretty much do everything, they just require different amounts of time for different use cases in practice. So if you are doing RDBMS specific I would probably lean towards using a Migrations approach. If however most of your schema evolution is about adding tables or columns and not about making deep changes to your existing tables which requires DML, I would go with an MDB2_Schema like approach.

Comments



Re: Database schema evolution

Great article! At my employer we use an extended version of the Migrations approach with validation and good support for branching. For production deployment we can create deployment packages which can be tested on a staging environment.
I think for database migration tools it is important to keep in mind that you'll need to support branching and good (unit)testing. That can be quite a hassle.

Re: Database schema evolution

@Robin: Any chance you can share the source with us?

Re: Database schema evolution

@Robin:

How did you achieve branching support?
AFAIK RoR like migrations don't play too well with branches, do they?

Re: Database schema evolution

@Lukas: I have discussed releasing the project into open source with my employer. There is a good chance that somewhere in the not-so-far future we'll make it available to the community. We've decided to wait until we consider it 'stable'.
Unfortunately there is little time available to do what would be necessary to make it a great opensource project. Drop me a mail at 'r dot schuil at gmail dot com' and i'll ping you if, when and where it will be available for download.

@Martin: it gave me quite a headache to figure out how we could allow branching of the database schema. we had to, given the amount of projects going on simultaniously. After a lot of puzzling with the team we came up with an approach that makes a lot of use of subversion for it's versioning and storage. With our previous tool we also had a problem that there was no clear procedure to deprecate old changefiles. The # of changefiles quickly grew to >1000 with a size of more than 300MB, slowing down our checkouts.
It's a bit too much to go into details here now, but I'll try to find some time (sorry, need to go on holiday ;)) I'll try to write a short article here or on my personal blog that answers your question. Please drop me a mail so I can contact you.