PHPCR on Doctrine DBAL

ramblings on PHP, SQL, the web, politics, ultimate frisbee and what else is on in my life

link by Lukas Kahwe Smith @ 2012-08-06 08:08 CEST

So I have noticed that people don't like it when I talk about all the cool stuff Jackrabbit can do. Many people are still scared of running Java stuff in production which I guess is to be expected since PHP shops tend to .. guess what .. PHP. So in this post I just want to talk about all the cool features we have ready to use in the pure PHP Doctrine DBAL based implementation of PHPCR. Just to say it again: PHP, no Java. So first up the implementation with all its features works with MySQL, PostgreSQL and SQLite. Given that we started with MySQL we ended up relying on few specific MySQL behaviors. These are all gone now, so adding another RDBMS is likely just a half days work, maybe a day if you look at the code base for the first time, then again the relevant code to edit are just a few places in two classes (Jackalope\Transport\DoctrineDBAL\Client and Jackalope\Transport\DoctrineDBAL\Query\QOMWalker). At any rate the implementation essentially gives you a tree document store on top of an RDBMS with support for references and so called node types allowing you to optionally add constraints to your document structure. Documents can have any number of properties which can choose from a wide set of datatypes like string, integer, date, url and binary. You also get both an OO and string based query API.

Lets briefly step back and discuss why the features I just mentioned are even defined in PHPCR. First up I think the fact that this API gives you native support for tree structures is very important. If all you are dealing with is a couple of pages in a flat structure then likely your needs are so trivial that you likely don't need a CMS to begin with though you might still choose to use a CMS due to expected future needs. But if you get to a half a dozen or more pages, quickly you will want to structure the pages into a tree structure. Furthermore a page itself might even be constructed of a tree structure of elements. Now you might be thinking great, but I want to reuse content pieces all over the site without having to duplicate the content. For this purpose PHPCR supports the concept of WEAKREFERENCEs and REFERENCEs. The difference is that the later also checks for referential integrity. So you can simple references documents to reuse documents inside your tree structure. This is supported in the implementation by all of the above listed RDBMS.

Now the next big feature is node types. In the opening paragraph I called this solution a "document store". I intentionally didn't call it a NoSQL database because while there is a quite a hype around them, followed by quite a bit of backlash. But what the term implies is more or less irrelevant to what people associate with NoSQL database. If CouchDB uses SQL or not isn't what makes it unique from an RDBMS, instead the fact that it doesn't have the concept of a fixed number of columns per table into which every document needs to fit in is what differentiates it. MongoDB came up with their own query syntax, but it could just as well been SQL. Well actually it gets worse, the departure from tables and columns is then also coined "schema less", which is even more confusing. MongoDB has the concept of collections, that is a schema. Most other "NoSQL" solutions do not require this, so at most one could say that they have no explicit schema, but they certainly have an implicit schema, ie. the structure of their real world content defines a de-facto schema. But actually many solutions even provide ways to define an explicit schema. For example in CouchDB you can add validation rules and you can define views etc. In PHPCR this feature is provided via node types. This optional feature can be used to prevent the content structure to get out of control. It can also be used to make it possible for tools introspecting the data structure to be able to rely on the availability of certain properties.

Lets talk about how to use this in practical terms. Your CMS might define a "page" node type. As children you might only allow nodes of type "widget" or type "media". You might define additional node types that extend from "media" like "image", "video" etc. This way people could use a generic browser on top of their content repository but could still be prevented from breaking the application by setting up a bogus content structure. Furthermore the type "media" would also define that certain properties must be set like "mime type", "creation time" etc. It is now also possible to write queries that filter the entire data structure for documents of type "media", or type "video" etc. As any resulting document will have those required properties the results can now also be used when listing the results. There are also predefined node types, for example to model a file system called nt:folder and nt:file. A structure using these node types could be exposed as a file system for example via WebDav. How about allowing your graphic designers to modify images without even touch a web browser? This would be one example of a "generic browser" I mentioned above in case you were wondering. On top of this you can define additional rules via mix-ins that can be assigned and removed from an existing node which also support inheritance.

Speaking of queries, above I talked about being able to create a tree structure of documents with references. This tree structure can be traversed by simply fetching a document by its path and then following down or up the tree structure or by following references or referrers. This is a very convenient want fast way to get content out of the repository. This is what I call the node traversal API. It only has very limited filtering capabilities. For example when iterating (of course we are using PHP iterators here) over children you can use a glob like syntax to limit the children. When getting the list of referrers you can limit things by just the WEAKREFERENCES or just the REFERENCES.

However when I speak of queries what I really mean is what I call the full text search API. I guess full text search API is also not accurate, since it also supports equality, greater/smaller filtering. It also supports operations on the tree structure like filtering out anything that isn't a decedent of a specific node. In theory the API would also support JOINs but we haven't implement that yet. Actually we haven't implemented full text search yet either. My experience with native full text search has been a long journey of disappointment. Its also very hard to implement a generic solution. As a result we are planning to implement a custom inverted index solution. As far as I know Drupal currently uses a similar approach. But with only a bit of code anyone also implement a solution that uses their chosen RDBMS native full text search.

This brings me to the biggest reason why PHPCR is a game changer. Once you have a solid API that covers all common use cases of managing content, you can start messing with the implementation without worry of breaking code on top of it. For example one of the things any CMS will want to do is leverage some cache to reduce database lookups, like if a certain page exits or not. This is of course implemented in CMS solutions like Drupal etc. However to improve performance a developer might choose to side step the standard API and write a custom SQL query. Once you do that your module will no longer work with the recently added support for MongoDB. Worse yet you can seriously screw up the cache unless you manually ensure to keep the cache in sync with your changes. Now with PHPCR instead of side stepping the API all you have to do is to adjust the implementation underneath so that any layer that is placed on top of PHPCR isn't side stepped.

So for example if you want to use Sphinx or Zend_Search_Lucene for search then simply adjust the PHPCR implementation to use that when executing queries. In the same way you can address other performance or software setup needs. Let that sink in for a second. Ever suffered the EAV misery that is Magento? EAV is definitely a viable approach to dealing with unstructured content in an RDBMS, but it comes with severe limitations that mean it cannot be used in many cases. If Magento's content storage API would have been compatible with PHPCR all you would need to do in this case is replace their implementation with another PHPCR implementation that would use an alternative approach. Heck you could even go and port Magento to document store of your choice without having to touch any of the Magento business logic. All it would require is an adjustment to the configuration and using the standard PHPCR XML based export/import format to move data from one implementation to another. As an aside PHPCR also supports UUID's to help in moving data between production, staging and development machines, another major issue in Magento. Alas Magento isn't using PHPCR and so switching Magento away from EAV is a pipe dream unless you essentially rewrite a large chunk of the code and break compatibility in the process. This is obviously a pitty because in the grand scheme of things Magento is among the PHP shop solutions that sucks the least. I mean that as a compliment. At Liip we use Magento quite a bit.

So to summarize all the things I mentioned above, minus full text support (which cryptocompress is working on atm) and WebDav support, are available today with PHPCR on top of Doctrine DBAL. Once we get full text support the main thing left from PHPCR missing is versioning support. For those that prefer working with plain PHP objects rather than with node instances, there is PHPCR ODM that provides developers an API similar to Doctrine ORM but of course with integration of all the above noted features. On top of also provides native i18n support. But that might be the topic of a future blog post. On a final note, if you want to learn more about the philosophy behind PHPCR stay tuned to future posts or take the short cut and just read chapter three of the JCR specification. I found it to be a surprisingly easy read. But again if you want to stay out of the Java world, just ask your questions here and you will get an answer :)

Comments

Re: PHPCR on Doctrine DBAL

link by Jarvis @ 2012-08-09 17:08 CEST

Link directly to an online version of JSR 283 (Content Repository) Chapter 3.

The links that you have to the JCR specification don't work. Even searching on the JCR site they link to the same non-working link. I have reported both your link and their own link to them.

Re: PHPCR on Doctrine DBAL

link by donnut @ 2012-08-10 12:08 CEST

Very convincing post. I was in doubt what dbs to use for a project, but I think this is the way to go! Any idea of the performance penalty?

Thanks

Re: PHPCR on Doctrine DBAL

link by Lukas Kahwe Smith @ 2012-08-13 14:08 CEST

I have not doen benchmarks yet. In general for a lookup by path PHPCR on Doctrine DBAL should perform very well since its an index supported lookup in a single table. All that then needs to be done is to read the xml from a single column. There is no need for a join or other operations. This architecture should also make it easy to stick caching in front that is automatically invalidated on data changes.

Where we still have severe performance problems is for any search queries as these currently rely on XPath filtering inside the SQL queries for which none of the supported RDBMS support indexes properly except for maybe PostgreSQL. For this we intended to implement the above mentioned secondary indexes.

Re: PHPCR on Doctrine DBAL

link by mz @ 2012-08-18 22:08 CEST

"Many people are still scared of running Java stuff in production which I guess is to be expected since PHP shops tend to .. guess what .. PHP.".

The thing is if you're going to use Java for your project, then why do you need PHP at all? Get spring framework and it'll
solve all your problems better than any PHP framework.
For me, it still looks like someone's taking Ferrari's engine and setting it up on a bicycle. Cool but akward and pointless.

Re: PHPCR on Doctrine DBAL

link by Lukas Kahwe Smith @ 2012-08-20 08:08 CEST

Well your argument is a bit like saying because MySQL is written in C/C++ all apps using it should best be written in C/C++. But sure, Spring seems to be a fairly decent framework. Then again, I have been training 2 teams in Symfony2 that used Spring before and both picked up Symfony2 really quickly and appreciated the higher speed of development .. even though IDE's for PHP/Symfony2 are not as mature as for Spring.

But there might even be other reasons to use PHP instead of Java. The obvious one is not having Java expertise. More practical is that often times people want to interface with other PHP apps and SDKs. Finally even if you have a server running Java for the content repository you might prefer the shared nothing architecture on PHP on your frontends.

planets

links

buttons