Yet Another Data Scientist: Cost of fields in Lucene index

How much does it cost to have 100K unique fields in your Lucene index?

Motivation for many fields

How would your index schema look if you want to allow users to search over their private data? Here's two options:

Having two fields: 'content' and 'user'.
When Bob searches for a Banana, the query is: content:banana AND user:bob.
When Alice searches for an Apple, the query is: content:apple AND user:alice.
A field per user: '[username]-content'
The when Bob searches for a Banana, the query is: bob-content:banana.
When Alice searches for an Apple, the query is: alice-content:apple.

With option (1) the index has just two fields.
With option (2) the index would have a field per user.

How much it costs?

I experimented by indexing: 100K docs, 1M unique terms, 10 fields/doc, fields hold 0-10 terms.
With two variations that differ by number of unique fields:

10 unique fields
100K unique fields.

The results chart below shows that with 100K fields the amount of heap needed to open a reader grows considerably (14->229MB), index size on disk is x4 times larger (59->253MB), indexing takes much more time (6->32sec).
These results were obtained with Lucene v3.0.3 (yeah super old, I know).

Conclusion

While Lucene can be looked at as a sparse column NO-SQL DB, fields don't come for free. Most troubling is the increase in heap size. Measure before using.

See source code in Gist.

Yet Another Data Scientist

Thursday, June 11, 2015

Cost of fields in Lucene index

How much does it cost to have 100K unique fields in your Lucene index?

Motivation for many fields

How much it costs?

Conclusion

No comments:

Post a Comment