ramblings on PHP, SQL, the web, politics, ultimate frisbee and what else is on in my life
back

Transforming end user queries to Solr

A bit less than a year ago I last did a presentation about a telephone book application where we used SQL to do some fairly advanced filtering over about 30 tables of data. The app generated SQL statements that filled pages, the more terms the more pages, but on a 10k dataset it still came back within a few milliseconds, thanks to a ton of indexing and denormalization tricks (SQL Server is a lot more powerful here than MySQL) I had applied. Now in a more recent project I am dealing with 10M+ dataset running on MySQL and so decided to learn about Solr. Wow, that thing is amazing and way more flexible in terms of query language than I expected. As a result I do not see it any more for just projects that are too big for an RDBMS, but more as the way to do search in general. I have mentioned resolutionfinder.org a few times (used to be called UN-informed.org). Solr is a key piece there and more importantly I am looking to expand the use of Solr query language quite a bit. Actually for those who know, you can already do a lot more powerful queries, something Liip will be investing some more time to make more accessible to end users with some UI tweaks planned in July. But in this blog post I want to talk about a prototype class I threw together (Look ma', I'm using git!) by working ezcSearch to help me in parsing and transforming end user queries into complex Solr queries.

Right now we have 3 input boxes on the resolutionfinder.org start page. The first one lets user input terms which are searched in the document title and the clause content. Here I recently exposed some of the syntax supported by the dismax handler like "+" to require and "-" to prohibit terms. Next is an input box that doesn't directly affect the search filtering. What it does is use auto suggest to build up a list of tag to filter on. The third input box actually is a separate search form entirely which is used for searching for specific documents by document code. My vision is to bring these all together into one input box. So if someone wants to do a fulltext search they could do something like the following:

"security council" +africa tag:malaria code:A/RES/*

This query would be parsed and result in looking for documents that must contain "africa" in the document title or clause content (scoring those that also contain "security council" higher), which are tagged with "malaria" and who's document code started with "A/RES/". This requires quite a bit of transformation. Actually the final queries I want to construct looks like this:

q=tag_ids:23 AND document_code_prefix:(A/RES/) AND (_query_:"{!dismax qf='content document_title' pf='content document_title' mm=0 v=$qq}")&qq="security council" +africa

So what happened here? As you can see I transformed "tag" to "tag_ids" and "code" to "document_code_prefix". These are the internal fields I use in the Solr index. There is actually also a "document_code" field, which I would use if the code would not end with a "*". The "document_code_prefix" just copies the "document_code" and applies an ngram filter to make prefix searches super fast. Also you may notice that instead of "malaria" I am filtering for 23, this is because right now I just index the tag id's and not the names. The next bit is taking all the un-fielded filters and apply a dismax search handler on them. In order to not require fancy escaping I use "v=$qq" to define that the actual search terms are passed in the "qq" GET parameter.

With the above mentioned class this sort of thing becomes fairly simple. I am using the symfony sfSolrPlugin, which provides query construction tools via the sfLuceneCriteria object. All I have to do is use the parser to parse the end user query into tokens and then use a custom term class that uses the sfLuceneCriteria instance to handle the serialization.


<?php
$criteria = new sfLuceneCriteria;
$stack = array(new phpSolrQueryTermCustom('AND', $criteria));
$parser = new phpSolrQueryParser($factory);

$tokens = $parser->parse($q, $stack);
$terms = $parser->processTerms($tokens);
var_dump($q);
var_dump($terms->serialize());
?>

You can check out the examples to get some more details about what is possible. Following is the code to the example on github, but here is a link to the actual term class and action class used in production.


<?php
class phpSolrQueryTermCustom extends phpSolrQueryTerm {
    protected $criteria;

    public function __construct($op = '', $criteria = null)
    {
        $this->op = $op;
        $this->criteria = $criteria;
    }

    protected function processTerm($term) {
        $prefix = $term->prefix;
        $field = $term->field;
        $term = (string)$term;
        $op = $this->op;
        $op = rtrim(" $op ").' ';
        if (!empty($field)) {
            switch ($field) {
            case 'code':
                $field = substr($term, -1, 1) === '*' ? 'document_code_prefix' : 'document_code';
                $term = $field === 'document_code_prefix' ? substr($term, 0, -1) : $term;
                break;
            case 'tag':
                $field = 'tag_ids';
/*
                $q = Doctrine_Query::create()
                    ->select('t.id')
                    ->from('Tag t')
                    ->where('t.name = ?', array($term));
                $term = $q->execute(array(), Doctrine_Core::HYDRATE_SINGLE_SCALAR);
*/
                $term = rand(1, 100);
                break;
            default:
                throw new Exception('Unsupported field: '.$field);
            }
            $this->criteria->addField($prefix.$field, $term, $op, true);
            return;
        }
        $term = parent::processTerm($term);
        return $prefix.$term;
    }

    public function serialize()
    {
        $terms = $this->terms;
        $op = $this->op;
        $op = rtrim(" $op ").' ';
        $dismax = array();
        foreach ($terms as $key => $term) {
            if ($term instanceOf phpSolrQueryTerm) {
                $term = $this->processTerm($term);
                if (!is_null($term)) {
                    $dismax[] = $term;
                }
            } else {
                $dismax[] = parent::processTerm($term);
            }
        }
        if (!empty($dismax)) {
            $dismax = implode(' ', $dismax);
            $subcritieria = new sfLuceneCriteria;
            $subcritieria->add('_query_:"{!dismax qf=\'content document_title\' pf=\'content document_title\' mm=0 v=$qq}"', $op, true);
            $this->criteria->add($subcritieria, 'AND');
            $this->criteria->addParam('qq', $dismax);
        }

        $q = $this->criteria->getParams();
        $q['q'] = array($this->criteria->getQuery());
        return $q;
    }
}
?>

Comments



Re: Transforming end user queries to Solr

Nice article, I'm starting to do the same kind of queries.

I also tried using lucene, but since in my webhost the indexing/optimizing process would die inexplicably and they didn't help a bit, i turned over to mysql Full Text Indexes.

I didn't delve into the expressiveness in Lucene, but what I saw is that for small sets of data (20K lines), basic text query's in an unoptimized lucene index was taking over 3s to process on by notebook (which is a lot).

It seems that for small sets of data MySQL is way faster.

Anyone can comment on that?

Lucene v.s. MySQL FullText indexes
in terms of performance, expressiveness, scale?

P.S. I'm just starting to define the search query architecture, but for now it's just a textbox, so I'm only supporting very simple queries.

Thanks in advance.

Re: Transforming end user queries to Solr

MySQL full text index sucks big time. In many cases you are better of doing a LIKE '%foo%'.

At any rate .. Solr usually replies within a few milliseconds even with a few million documents. However the first requests take longer since index needs to loads up the index structure into memory which might explain why you think it was slow.