ramblings on PHP, SQL, the web, politics, ultimate frisbee and what else is on in my life

DeltaQuery with Solr DIH

Recently it seems like pretty much all projects here at Liip are using Solr in some form or another. Never thought that all my PHP projects would end up relying on Java. This is just a short blog post to document an approach I came up with to make delta updates more efficient and at the same time easier to maintain than the "standard" approach explain in the docs. The Solr DIH handles importing documents from external data sources. In my case I have about 5GB of data in an RDBMS that gets updates every night. The standard approach in Solr is to define one query for the initial import and a second query to fetch the IDs of documents that have changed and a third query to fetch the data that changed. Especially if you expect a large number of changes this isn't very efficient (*). Furthermore if both in the initial import and the delta case you have the same SELECT list, its tedious to maintain 3 queries where two are almost identical and one still very similar.

So take the example from the docs:

<entity name="item" pk="ID"
  query="SELECT * FROM item"
  deltaImportQuery="SELECT * FROM item
    WHERE id = '${dataimporter.delta.id}'"
  deltaQuery="SELECT id FROM item
    WHERE last_modified > '${dataimporter.last_index_time}'">

This can be rewritten as follows:

<entity name="item" pk="ID"
  query="SELECT * FROM item
    WHERE '${dataimporter.request.clean}' != 'false'
      OR last_modified > '${dataimporter.last_index_time}'">

When doing a normal full import solr defaults the clean to true (watch out you can override this default in the solrconfig.xml so you might want to always set it explicitly to true). As a result the first part of the WHERE condition will be 'true' != 'false' which any decent RDBMS will figure out will lead to the entire all always evaluating to true aka reading the entire table:

Now when doing a delta import you do not use the delta import, but instead you do a normal full-import but with the clear GET parameter set to false:

In this case the first part of the WHERE will be 'false' != 'false' which is obviously always false and any RDBMS should optimize that away and just evaluate the second condition.

In this case it means obviously that in case you also want to use deletedPkQuery then when running the delta-import command is still necessary.

(*) There might be situations where separate queries might be more efficient. Consider an example where the query that fetches the document data is a very complex join and the RDBMS doesn't do a great job with coming up with a good query plan and you can determine a relatively small number of ID's that change every night with a SELECT without a JOIN, then it might be more efficient to do one query to fetch the ID's and then one query per fetched ID that does the complex join.


Re: DeltaQuery with Solr DIH

This is nice. Please add this to the Solr wiki.

The way he describes in the wiki was from before the variable substitution got fancy.

Re: DeltaQuery with Solr DIH

Ok, I added a note to the delta command section linking to a new page on the wiki with the details.

Before you can post a comment please solve the following captcha.

your name: