Advanced Filter Caching in Solr

10 02 2012


Note: my blog has moved. Please see : Advanced Filter Caching in Solr


MurmurHash3 for Java

15 09 2011

Note: my blog has moved: See MurmurHash3 for Java

Solr’s Realtime Get

7 09 2011

Note: my blog has moved to, please see Solr’s Realtime Get

Solr relevancy function queries

10 03 2011

Note: my blog has moved here

Solr Result Grouping / Field Collapsing Improvements

17 12 2010

I previously introduced Solr’s Result Grouping, also called Field Collapsing, that limits the number of documents shown for each “group”, normally defined as the unique values in a field or function query.

Since then, there have been a number of bug fixes, performance improvements, and feature enhancements. You’ll need a recent nightly build of Solr 4.0-dev, or the newly released LucidWorks Enterprise v1.6, our commercial version of Solr.

Feature Enhancements

One improvement is the ability to group by query via the group.query parameter. This functionality is very similar to facet.query, except that it retrieves the top documents that match the query, not just the count. This has many potential uses, including always getting the top documents for specific groups, or defining custom groups such has price ranges.

Another useful capability is the addition of the group.main parameter. Setting this to true causes the results of the first grouping command to be used as the main result list in a flattened response format that legacy clients will be able to handle.

For example, the grouped response format normally returns highly structured results under “grouped”.

     "groupValue":"Apache Software Foundation",
        "name":"Solr, the Enterprise Search Server",
        "manu":"Apache Software Foundation"}]
     "groupValue":"Corsair Microsystems Inc.",
        "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail",
        "manu":"Corsair Microsystems Inc."}]

If we add group.main=true to the request, then we get back a much more familiar looking response (i.e. it looks like a normal non-grouped response):

    "name":"Solr, the Enterprise Search Server",
    "manu":"Apache Software Foundation"},
    "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail",
    "manu":"Corsair Microsystems Inc."},

One can also use the group.format=simple parameter to select this simplified flattened response within the normal “grouped” section of the response.

Other recent enhancements include support for debugging explain, highlighting, faceting, and the ability to handle missing values in the grouping field by treating all documents without a value as being in the “null” group.

Performance Enhancements

There have been a number of performance enhancements, including an improvement to the short circuiting logic… cutting off low ranking documents earlier in the process. This important optimization resulted in a speedup of about 9x for collapsing on certain fields!

Collapsing on string fields was further optimized with specialized code that worked on ord values instead of the string values. This doubled the performance yet again!

Please see the Solr Wiki for further documentation on all of result grouping’s capabilities and parameters.

Indexing JSON in Solr 3.1

8 12 2010

Solr has been able to produce JSON results for a long time, by adding wt=json to any query. A new capability has recently been added to allow indexing in JSON, as well as issuing other update commands such as deletes and commits.

All of the functionality that was available through XML update commands can now be given in JSON.
For example, you can index a document like so:

$ curl http://localhost:8983/solr/update/json -H 'Content-type:application/json' -d '
  "add": {
    "doc": {
      "id" : "ISBN:978-0641723445",
      "title" : "The Lightning Thief",
      "author" : "Rick Riordan",
      "series_t" : "Percy Jackson and the Olympians",
      "cat" : ["book","hardcover"],
      "genre_s" : "fantasy",
      "pages_i" : 384,
      "price" : 12.50,
      "inStock" : true,
      "popularity" : 10

Of course, if you want the doc to be visible, you must do a commit. This could have been done by adding a commit=true parameter to the URL in the previous command, or we could have added a commit command within the JSON itself. This time we’ll issue a separate commit command.

curl "http://localhost:8983/solr/update/json?commit=true"

And now, we can query the Solr index and verify the document has been correctly added (requesting the results in JSON of course!)


There’s more documentation on the Solr Wiki.
To use this functionality, you’ll need to use LucidWorks Enterprise (our commercial version of Solr), or a recent Solr 3.1-dev or 4.0-dev nightly build.

Solr Result Grouping / Field Collapsing

16 09 2010

Result Grouping, also called Field Collapsing, has been committed to Solr!
This functionality limits the number of documents for each “group”, usually defined by the unique values in a field (just like field faceting).

You can think of it like faceted search, except instead of just getting a count, you get the top documents for that constraint or category. There are tons of potential use cases:

  • For web search, only show 1 or 2 results for a given website by collapsing on a site field.
  • For email search, only show 1 or 2 results for a given email thread
  • For e-commerce, show the top 3 products for each store category (i.e. “electronics”, “housewares”)
  • Hiding duplicate documents at query time.

In addition to being able to group by the values of a field, you can also group by the values of a function query. Given that geo search works as a function query, this also opens up possibilities for showing top query matches within 1 mile, between 1 and 2 miles, etc.

Just like faceting, we’ll be adding new functionality and making continual improvements.
Result Grouping is documented on the Solr Wiki, and you will need a recent
nightly build of Solr 4.0-dev to try it out (just make sure it’s dated after this post).

CSV output for Solr

29 07 2010

Solr has been able to slurp in CSV for quite some time, and now I’ve finally got around to adding the ability to output query results in CSV also. The output format matches what the CSV loader can slurp.

Adding a simple wt=csv to a query request will cause the docs to be written in a CSV format that can be loaded into something like Excel.


IW-02,"electronics,connector",iPod & iPod Mini USB 2.0 Cable,1,11.5,0.98867977
F8V7067-APL-KIT,"electronics,connector",Belkin Mobile Power Cord for iPod w/ Dock,1,19.95,0.6523595
MA147LL/A,"electronics,music",Apple 60 GB iPod with Video Playback Black,10,399.0,0.2446348

CSV formats tend to vary, so there are a number of parameters that allow you to customize the output. For example setting csv.escape=\ and csv.separator=%09 (a URL-encoded tab character) will use a tab separator and backslash escaping to match the default CSV format that MySQL uses.


score	id
0.98867977	IW-02
0.6523595	F8V7067-APL-KIT
0.2446348	MA147LL/A

The CSVResponseWriter is documented on the Solr Wiki, but you will need a recent
nightly build (Solr 3.1-dev or Solr 4.0-dev) to try it out.

Ranges over Functions in Solr 1.4

6 07 2009

Solr 1.4 contains a new feature that allows range queries or range filters over arbitrary functions.  It’s implemented as a standard Solr QParser plugin, and thus easily available for use any place that accepts the standard Solr Query Syntax by specifying the frange query type.  Here’s an example of a filter specifying the lower and upper bounds for a function:

fq={!frange l=0 u=2.2}log(sum(user_ranking,editor_ranking))

The other interesting use for frange is to trade off memory for speed when doing range queries on any type of single-valued field.  For example, one can use frange on a string field provided that there is only one value per field, and that numeric functions are avoided.

For example, here is a filter that only allows authors between martin and rowling, specified using a standard range query:
fq=author_last_name:[martin TO rowling]

And the same filter using a function range query (frange):
fq={!frange l=martin u=rowling}author_last_name

This can lead to significant performance improvements for range queries with many terms between the endpoints, at the cost of memory to hold the un-inverted form of the field in memory (i.e. a FieldCache entry – same as would be used for sorting). If the field in question is already being used for sorting or other function queries, there won’t be any additional memory overhead.

The following chart shows the results of a test of frange queries vs standard range queries on a string field with 200,000 unique values. For example, frange was 14 times faster when executing a range query / range filter that covered 20% of the terms in the field. For narrower ranges that matched less than 5% of the values, the traditional range query performed better.

Percent of terms covered Fastest implementation Speedup (how many times faster)
100% frange 43.32
20% frange 14.25
10% frange 8.07
5% frange 1.337
1% normal range query 3.59

Of course, Solr 1.4 also contains the new TrieRange functionality that will generally have the best time/space profile for range queries over numeric fields.

Filtered query performance increases for Solr 1.4

27 05 2009

One of the many performance improvements in the upcoming Solr 1.4 release involves improved filtering performance. Solr 1.4 filters are both faster (anywhere from 30% to 80% faster to calculate intersections, depending on configuration), take less memory (40% smaller), and are more efficiently applied to the query during a search.

In previous Solr releases, filters were applied after the main query and thus had little impact on overall query performance. Filters are now checked in parallel with the query, resulting in greater speedups the fewer documents that match the filters.

Example: Adding a filter that matched 10% of a large index resulted in a 300% performance increase for a dismax query consisting of three words on a single field with proximity boost.

Related issues: