Advanced Filter Caching in Solr

10 02 2012

Note: my blog has moved here

Advanced Filter Caching is a relatively new feature in Solr, available in version 3.4 and above. It allows precise control over how Solr handles filter queries in order to maximize performance, including the ability to specify if a filter is cached, the order filters are evaluated, and post filtering.

Filter Queries in Solr

Adding a filter expressed as a query to a Solr request is a snap… simply add an additional fq parameter for each filter query.

http://localhost:8983/solr/select?
   q=cars
   &fq=color:red
   &fq=model:Corvette
   &fq=year:[2005 TO *]

By default, Solr resolves all of the filters *before* the main query. Each filter query is looked up individually in Solr’s filterCache (which is pretty advanced itself, supporting concurrent lookups, different eviction policies such as LRU or LFU, and auto-warming). Caching each filter query separately accelerates Solr’s query throughput by greatly improving cache hit rates since many types of filters tend to be reused across different requests.

To Cache or not to Cache

The new advanced filter control API adds the ability to *not* cache a filter. Some filters may see almost no reuse across different requests, and not attempting to cache them can lead to a smaller, more effective filterCache with a higher hit rate.

To tell Solr not to cache a filter, we use the same powerful local params DSL that adds metadata to query parameters and is used to specify different types of query syntaxes and query parsers. For a normal query that does not have any localParam metadata, simply prepend a local param of cache=false. For example:
&fq={!cache=false}year:[2005 TO *]

To add cache=false to a filter query that already had localParams, simply add it right in with the rest of the params. For example, if we want to use Solr’s native spatial abilities to restrict our matches to locations within 50 km of Stanford, our filter query would look like:
&fq={!geofilt sfield=location pt=37.42,-122.17 d=50}

It’s easy to modify this filter to tell Solr not to cache it by adding cache=false in with the rest of the local parameters:
&fq={!geofilt sfield=location pt=37.42,-122.17 d=50 cache=false}

Leapfrog anyone?

When a filter isn’t generated up front and cached, it’s executed in parallel with the main query. First, the filter is asked about the first document id that it matches. The query is then asked about the first document that is equal to or greater than that document. The filter is then asked about the first document that is equal to or greater than that. The filter and the query play this game of leapfrog until they land on the same document and it’s declared a match, after which the document is collected and scored.

How much is that filter?

Advanced filtering adds even more fine grained control by introducing the notion of cost. If there are multiple non-cached filters in a response, filters with a lower cost will be checked before those with a higher cost.

&fq={!cache=false cost=10}year:[2005 TO *]
&fq={!geofilt cache=false cost=20}
&pt=37.42,-122.17
&sfield=location
&d=50

In the example above, the filter based on year has a lower cost and will thus always be checked before the spatial filter.

As an aside, notice how spatial queries will use global spatial request parameters if they are not specified locally. This can make it even easier to construct requests containing spatial functions.

Expensive Filters

Some filters are slow enough that you don’t even want to run them in parallel with the query and other filters, even if they are consulted last, since asking them “what is the next doc you match on or after this given doc” is so expensive. For these types of filters, you really want to only ask them “do you match this doc” only after the query and all other filters have been consulted. Solr has special support for this called “post filtering“.

Post filtering is triggered by filters that have a cost>=100 and have explicit support for it. If there are multiple post filters in a single request, they will be ordered by cost.

The frange qparser has post filter support and allows powerful queries specifying ranges over arbitrarily complex function queries.

For example, if we wanted to take the log of popularity, divide it by the square root of the distance, and filter out documents with a result less than 5, we could run this as a post filter using frange:

&fq={!frange l=5 cache=false cost=200}div(log(popularity),sqrt(geodist()))

Post filtering support for the spatial filter queries bbox and geofilt has just been added to Solr 4.0 too. To execute our previous un-cached spatial filter as a post filter, simply modify it’s cost to be greater than 100:

&fq={!geofilt cache=false cost=150}
&pt=37.42,-122.17
&sfield=location
&d=50

Custom Filters

If you have expensive custom logic you’d like to add as a post filter (say per-document custom security ACLs), you can implement your own QParserPlugin that returns Query objects that implement Solr’s PostFilter interface. You can set the default cost or hardcode a cost higher than 100 if you want to only support post filtering. Then, you can use your custom parser as you would any other builtin query type via fq={!myqueryparser} and Solr will handle the rest!

Try it out!

In conclusion, hopefully this gives more insight into just one of many factors working under the hood to make Solr so fast.
To try out the latest functionality, you can always get a nightly build of trunk. Feedback is always appreciated!





MurmurHash3 for Java

15 09 2011

Background

I needed a really good hash function for the distributed indexing we’re implementing for Solr. Since it will be used for partitioning documents, it needed to be really high quality (well distributed) since we don’t want uneven shards. It also needs to be cross-platform, so a client could calculate this hash value themselves if desired, to predict which node has a given document.

MurmurHash3

MurmurHash3 is one of the top favorite new hash function these days, being both really fast and of high quality. Unfortunately it’s written in C++, and a quick google did not yield any suitable high quality port. So I took 15 minutes (it’s small!) to port the 32 bit version, since it should be faster than the other versions for small keys like document ids. It works in 32 bit chunks and produces a 32 bit hash – more than enough for partitioning documents by hash code.

MurmurHash3-java

It would be nice to prevent others from having to do the same thing. Since stuff like this is small enough, I simply put it under the public domain and uploaded to github. This way anyone can just copy the file or the function into their project and avoid extra dependencies and license hassles.

Here’s the code, copy away!





Solr’s Realtime Get

7 09 2011

Solr took another step toward increasing it’s NoSQL datastore capabilities, with the addition of realtime get.

Background

As readers probably know, Lucene/Solr search works off of point-in-time snapshots of the index. After changes have been made to the index, a commit (or a new Near Real Time softCommit) needs to be done before those changes are visible. Even with Solr’s new NRT (Near Real Time) capabilities, it’s probably not advisable to reopen the searcher more than once a second. However there are some use cases that require the absolute latest version of a document, as opposed to just a very recent version. This is where Solr’s new realtime get comes to the rescue, where the latest version of a document can be retrieved without reopening the searcher and risk disrupting other normal search traffic.

The Realtime-Get API

The realtime get handler is registered at the /get URL. As an example, a request like
http://localhost:8983/solr/get?id=SOLR1000&fl=id,name&wt=json
returns a response like

{"doc":{"id":"SOLR1000","name":"Solr, the Enterprise Search Server"}}

Notice that the optional fl (field list) parameter works as normal, allowing you to select the fields you want returned.

There’s also a realtime get component that can be inserted into any request handler, including the standard request handler.

How it works

The realtime get feature uses transaction logging to keep track of uncommitted updates to the index.  When a get request for a document is received, this log is checked first and retrieved from there if found.  If it’s not found, then the latest opened searcher is used to retrieve the document.  Checking the log is super fast, and IO reads from the log are fully concurrent for maximum scalability.

Try it out

Download a recent nightly build of Solr 4.0-dev and follow the Quick Start guide  on the Solr wiki.  Feedback on the solr-user mailing list is always appreciated!





Solr relevancy function queries

10 03 2011

Lucene’s default ranking function uses factors such as tf, idf, and norm to help calculate relevancy scores.
Solr has now exposed these factors as function queries.

  • docfreq(field,term) returns the number of documents that contain the term in the field.
  • termfreq(field,term) returns the number of times the term appears in the field for that document.
  • idf(field,term) returns the inverse document frequency for the given term, using the Similarity for the field.
  • tf(field,term) returns the term frequency factor for the given term, using the Similarity for the field.
  • norm(field) returns the “norm” stored in the index, the product of the index time boost and then length normalization factor.
  • maxdoc() returns the number of documents in the index, including those that are marked as deleted but have not yet been purged.
  • numdocs() returns the number of documents in the index, not including those that are marked as deleted but have not yet been purged.

We can use these new functions to develop and test custom ranking functions!  For example, if we wanted simple tf*idf for a given term, we could issue the following function query (if you have solr’s example server running with exampledocs indexed, just click on the following link):

http://localhost:8983/solr/select/?fl=score,id&defType=func&q=mul(tf(text,memory),idf(text,memory))

To avoid repeating the term we are using (text,memory) we can pull the field and term out into other query parameters:

http://localhost:8983/solr/select/?fl=score,id&defType=func&q=mul(tf($f,$t),idf($f,$t))&f=text&t=memory

Utilizing Solr’s new ability to sort by arbitrary function queries, we could now sort a query by the number of times a specific term appears in each document.  The following query searches for documents matching “DDR”, but then sorts by the number of times “memory” appears in the text field.

http://localhost:8983/solr/select/?fl=score,id&q=DDR&sort=termfreq(text,memory) desc

We could also utilize the “norm” function to sort by the longest field first.  This assumes there were no index time boosts and thus the norm is just the standard length normalization factor.

http://localhost:8983/solr/select/?fl=score,id&q=DDR&sort=norm(text) asc

Given Solr’s plethora of function queries (including the new spatial queries that return distance between points), the possibilities are almost endless.  To try this out,  you’ll need a recent nightly build of Solr 4.0-dev, or LucidWorks Enterprise, our commercial version of Solr.





Solr Result Grouping / Field Collapsing Improvements

17 12 2010

I previously introduced Solr’s Result Grouping, also called Field Collapsing, that limits the number of documents shown for each “group”, normally defined as the unique values in a field or function query.

Since then, there have been a number of bug fixes, performance improvements, and feature enhancements. You’ll need a recent nightly build of Solr 4.0-dev, or the newly released LucidWorks Enterprise v1.6, our commercial version of Solr.

Feature Enhancements

One improvement is the ability to group by query via the group.query parameter. This functionality is very similar to facet.query, except that it retrieves the top documents that match the query, not just the count. This has many potential uses, including always getting the top documents for specific groups, or defining custom groups such has price ranges.

Another useful capability is the addition of the group.main parameter. Setting this to true causes the results of the first grouping command to be used as the main result list in a flattened response format that legacy clients will be able to handle.

For example, the grouped response format normally returns highly structured results under “grouped”.
…&q=solr+memory&group=true&group.field=manu_exact


 "grouped":{
  "manu_exact":{
   "matches":6,
   "groups":[{
     "groupValue":"Apache Software Foundation",
     "doclist":{"numFound":1,"start":0,"docs":[
       {
        "id":"SOLR1000",
        "name":"Solr, the Enterprise Search Server",
        "manu":"Apache Software Foundation"}]
     }},
    {
     "groupValue":"Corsair Microsystems Inc.",
     "doclist":{"numFound":2,"start":0,"docs":[
       {
        "id":"VS1GB400C3",
        "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail",
        "manu":"Corsair Microsystems Inc."}]
     }},
[...]

If we add group.main=true to the request, then we get back a much more familiar looking response (i.e. it looks like a normal non-grouped response):
…&q=solr+memory&group=true&group.field=manu_exact&group.main=true


 "response":{"numFound":6,"start":0,"docs":[
   {
    "id":"SOLR1000",
    "name":"Solr, the Enterprise Search Server",
    "manu":"Apache Software Foundation"},
   {
    "id":"VS1GB400C3",
    "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail",
    "manu":"Corsair Microsystems Inc."},

One can also use the group.format=simple parameter to select this simplified flattened response within the normal “grouped” section of the response.

Other recent enhancements include support for debugging explain, highlighting, faceting, and the ability to handle missing values in the grouping field by treating all documents without a value as being in the “null” group.

Performance Enhancements

There have been a number of performance enhancements, including an improvement to the short circuiting logic… cutting off low ranking documents earlier in the process. This important optimization resulted in a speedup of about 9x for collapsing on certain fields!

Collapsing on string fields was further optimized with specialized code that worked on ord values instead of the string values. This doubled the performance yet again!

Please see the Solr Wiki for further documentation on all of result grouping’s capabilities and parameters.





Indexing JSON in Solr 3.1

8 12 2010

Solr has been able to produce JSON results for a long time, by adding wt=json to any query. A new capability has recently been added to allow indexing in JSON, as well as issuing other update commands such as deletes and commits.

All of the functionality that was available through XML update commands can now be given in JSON.
For example, you can index a document like so:


$ curl http://localhost:8983/solr/update/json -H 'Content-type:application/json' -d '
{
  "add": {
    "doc": {
      "id" : "ISBN:978-0641723445",
      "title" : "The Lightning Thief",
      "author" : "Rick Riordan",
      "series_t" : "Percy Jackson and the Olympians",
      "cat" : ["book","hardcover"],
      "genre_s" : "fantasy",
      "pages_i" : 384,
      "price" : 12.50,
      "inStock" : true,
      "popularity" : 10
    }
  }
}'

Of course, if you want the doc to be visible, you must do a commit. This could have been done by adding a commit=true parameter to the URL in the previous command, or we could have added a commit command within the JSON itself. This time we’ll issue a separate commit command.


curl "http://localhost:8983/solr/update/json?commit=true"

And now, we can query the Solr index and verify the document has been correctly added (requesting the results in JSON of course!)

http://localhost:8983/solr/select?wt=json&indent=true&q=title:lightning

There’s more documentation on the Solr Wiki.
To use this functionality, you’ll need to use LucidWorks Enterprise (our commercial version of Solr), or a recent Solr 3.1-dev or 4.0-dev nightly build.





Solr Result Grouping / Field Collapsing

16 09 2010

Result Grouping, also called Field Collapsing, has been committed to Solr!
This functionality limits the number of documents for each “group”, usually defined by the unique values in a field (just like field faceting).

You can think of it like faceted search, except instead of just getting a count, you get the top documents for that constraint or category. There are tons of potential use cases:

  • For web search, only show 1 or 2 results for a given website by collapsing on a site field.
  • For email search, only show 1 or 2 results for a given email thread
  • For e-commerce, show the top 3 products for each store category (i.e. “electronics”, “housewares”)
  • Hiding duplicate documents at query time.

In addition to being able to group by the values of a field, you can also group by the values of a function query. Given that geo search works as a function query, this also opens up possibilities for showing top query matches within 1 mile, between 1 and 2 miles, etc.

Just like faceting, we’ll be adding new functionality and making continual improvements.
Result Grouping is documented on the Solr Wiki, and you will need a recent
nightly build of Solr 4.0-dev to try it out (just make sure it’s dated after this post).








Follow

Get every new post delivered to your Inbox.