Sunday, June 14, 2015

Solr soft commit Gotcha - OOM

Without frequent hard commits, intense indexing rate bundled with Solr soft commits could lead to an out of memory error:


Our Solr Collection stores browsing history with a max search visibility requirement of 30 seconds. Having multiple writer processes, we designed to auto soft and auto hard commit to avoid explicit commits storm.

An OOM during an intense indexing session caught us by surprise . A quick heap dump inspection revealed a fat RAMDirectory in heap. Why? Well, soft commits uses Lucene NRT, which stores data in RAM, the memory is freed up once a hard commit arrives to persisted the data to disk ensuring durability.

The fault was in our auto hard commit policy, which was time only (maxTime=10min). If you index fast enough within those 10 minutes you'll run out of memory.
We fixed that by adding a maxDocs=50,000 limit.
Where maxDocs is calculated by:
[size of doc] X [num of docs] <= [memory we want to spend per core]
500b          X 50,000        <= 25m

We're currently running with 8 shard replicas so max memory usage for NRT would be: 200m.

So soft

Conclusion

When soft committing, make sure to limit heap usage by specifying both maxDocs and maxTime limits in your auto hard commit policy.
This is one of the factors that will affect your Solr memory usage.
Happy soft committing.


Thursday, June 11, 2015

Cost of fields in Lucene index


How much does it cost to have 100K unique fields in your Lucene index?

Motivation for many fields

How would your index schema look if you want to allow users to search over their private data? Here's two options:
  1. Having two fields: 'content' and 'user'.
    When Bob searches for a Banana, the query is: content:banana AND user:bob.
    When Alice searches for an Apple, the query is: content:apple AND user:alice.
  2. A field per user: '[username]-content'
    The when Bob searches for a Banana, the query is: bob-content:banana.
    When Alice searches for an Apple, the query is: alice-content:apple.
With option (1) the index has just two fields.
With option (2) the index would have a field per user.

How much it costs?

I experimented by indexing: 100K docs, 1M unique terms, 10 fields/doc, fields hold 0-10 terms.
With two variations that differ by number of unique fields:
  1. 10 unique fields
  2. 100K unique fields. 
The results chart below shows that with 100K fields the amount of heap needed to open a reader grows considerably (14->229MB), index size on disk is x4 times larger (59->253MB), indexing takes much more time (6->32sec).
These results were obtained with Lucene v3.0.3 (yeah super old, I know).



Conclusion

While Lucene can be looked at as a sparse column NO-SQL DB, fields don't come for free. Most troubling is the increase in heap size. Measure before using.

See source code in Gist.