Yet Another Data Scientist

How much does it cost to have 100K unique fields in your Lucene index?

Motivation for many fields

How would your index schema look if you want to allow users to search over their private data? Here's two options:

Having two fields: 'content' and 'user'.
When Bob searches for a Banana, the query is: content:banana AND user:bob.
When Alice searches for an Apple, the query is: content:apple AND user:alice.
A field per user: '[username]-content'
The when Bob searches for a Banana, the query is: bob-content:banana.
When Alice searches for an Apple, the query is: alice-content:apple.

With option (1) the index has just two fields.
With option (2) the index would have a field per user.

How much it costs?

I experimented by indexing: 100K docs, 1M unique terms, 10 fields/doc, fields hold 0-10 terms.
With two variations that differ by number of unique fields:

10 unique fields
100K unique fields.

The results chart below shows that with 100K fields the amount of heap needed to open a reader grows considerably (14->229MB), index size on disk is x4 times larger (59->253MB), indexing takes much more time (6->32sec).
These results were obtained with Lucene v3.0.3 (yeah super old, I know).

Conclusion

While Lucene can be looked at as a sparse column NO-SQL DB, fields don't come for free. Most troubling is the increase in heap size. Measure before using.

See source code in Gist.

	import java.io.File;
	import java.io.IOException;
	import java.lang.management.ManagementFactory;
	import java.util.Random;
	import java.util.UUID;

	import org.apache.lucene.analysis.Analyzer;
	import org.apache.lucene.analysis.WhitespaceAnalyzer;
	import org.apache.lucene.document.Document;
	import org.apache.lucene.document.Field;
	import org.apache.lucene.document.Field.Index;
	import org.apache.lucene.document.Field.Store;
	import org.apache.lucene.index.CorruptIndexException;
	import org.apache.lucene.index.IndexReader;
	import org.apache.lucene.index.IndexWriter;
	import org.apache.lucene.queryParser.QueryParser;
	import org.apache.lucene.search.IndexSearcher;
	import org.apache.lucene.search.Query;
	import org.apache.lucene.search.ScoreDoc;
	import org.apache.lucene.store.FSDirectory;
	import org.apache.lucene.store.LockObtainFailedException;
	import org.apache.lucene.util.Version;

	public class FieldsIndexingMemTest {

	private static IndexReader ireader;
	private static IndexSearcher isearcher;
	private static FSDirectory directory;
	private static Analyzer analyzer;
	private static QueryParser parser;
	private static IndexWriter iwriter;
	private static Random random = new Random();

	private enum Mode {
	fewFields, manyFields
	};

	private static Mode mode = Mode.fewFields;
	// private static Mode mode = Mode.manyFields;

	/**
	* @param args
	*/
	@SuppressWarnings("deprecation")
	public static void main(String[] args) throws Exception {
	System.out.println(ManagementFactory.getRuntimeMXBean().getName());
	System.out.println("mode=" + mode);
	long before = System.currentTimeMillis();
	printOutMemory("Before starting");

	File indexFolder = new File("C:\\temp\\index\\" + mode.toString());
	directory = FSDirectory.open(new File("C:\\temp\\index\\" + mode.toString()));
	analyzer = new WhitespaceAnalyzer();
	parser = new QueryParser(Version.LUCENE_CURRENT, "content", analyzer);

	if (indexFolder.exists()) {
	openWriterOverExistingIndex();
	} else {
	createNewIndex();
	}

	printOutMemory("Before opening reader+searcher+running dummy query");
	ireader = IndexReader.open(directory);
	isearcher = new IndexSearcher(ireader);
	Query query = parser.parse("name:a*");
	System.out.println(query.rewrite(ireader));
	ScoreDoc[] hits = isearcher.search(query, null, 1000000).scoreDocs;
	printOutMemory("After opening reader+searcher+running dummy query");

	printOutMemory("Before closing Lucene objects");
	System.out.println("Hit enter key to continue...");
	System.in.read();

	ireader.close();
	iwriter.close();
	directory.close();

	System.out.println();
	System.out.println("Done. Runtime duration=" + (System.currentTimeMillis() - before) + "ms");
	printOutMemory("After closing Lucene objects");
	}

	private static void printOutMemory(String prefixMessage) {
	Runtime runtime = Runtime.getRuntime();
	long beforeUsedMemory = runtime.totalMemory() - runtime.freeMemory();
	for (int i = 0; i < 10; i++) {
	System.gc();
	}
	try {
	Thread.sleep(500);
	} catch (InterruptedException e) {
	e.printStackTrace();
	}
	for (int i = 0; i < 10; i++) {
	System.gc();
	}
	long afterUsedMemory = runtime.totalMemory() - runtime.freeMemory();
	System.out.println(prefixMessage + " - Used memory=" + (afterUsedMemory / 1024 / 1024) + "MB (beforeUsedMemory=" + beforeUsedMemory / 1024 / 1024
	+ "MB)");
	}

	private static void openWriterOverExistingIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
	System.out.println("Opening existing index");
	long before = System.currentTimeMillis();
	iwriter = new IndexWriter(directory, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED);
	System.out.println("open existing index duration=" + (System.currentTimeMillis() - before) + "ms");
	}

	private static void createNewIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
	System.out.println("Creating index from scratch");
	iwriter = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

	int numUniqueTerms = 100 * 1000;
	String[] uniqueTerms = new String[numUniqueTerms];
	for (int i = 0; i < uniqueTerms.length; i++) {
	uniqueTerms[i] = String.valueOf(i);
	}

	int numUniqueFieldNames = 1 * 1000 * 1000;
	String[] uniqueFieldNames = new String[numUniqueFieldNames];
	for (int i = 0; i < uniqueFieldNames.length; i++) {
	uniqueFieldNames[i] = "community_tag_" + UUID.randomUUID().toString();
	}

	int numOfDocs = 100 * 1000;
	for (int i = 0; i < numOfDocs; i++) {
	addNewDocument(numUniqueTerms, uniqueTerms, numUniqueFieldNames, uniqueFieldNames);
	if (i % 1000 == 0) {
	System.out.println("Progress: " + (100 * i / numOfDocs) + "% (wrote " + i + " documents)");
	}
	}
	// release mem
	uniqueTerms = null;
	uniqueFieldNames = null;

	printOutMemory("before commit()");
	iwriter.commit();
	printOutMemory("after commit()");
	}

	private static void addNewDocument(int numUniqueTerms, String[] uniqueTerms, int numUniqueFieldNames, String[] uniqueFieldNames)
	throws CorruptIndexException, IOException {
	Document doc = new Document();
	for (int j = 0; j < 10; j++) {
	String fieldName = (mode == Mode.fewFields) ? ("community_tag_" + j) : uniqueFieldNames[random.nextInt(numUniqueFieldNames)];
	String fieldValue = getFieldValue(numUniqueTerms, uniqueTerms);
	doc.add(new Field(fieldName, fieldValue, Store.YES, Index.NOT_ANALYZED_NO_NORMS));
	}
	iwriter.addDocument(doc);
	}

	private static String getFieldValue(int numUniqueTerms, String[] uniqueTerms) {
	int termsInField = random.nextInt(10);
	StringBuilder sb = new StringBuilder();
	for (int w = 0; w < termsInField; w++) {
	sb.append(uniqueTerms[random.nextInt(numUniqueTerms)]).append(" ");
	}
	String fieldValue = sb.toString();
	return fieldValue;
	}
	}

view raw FieldsIndexingMemTest.java hosted with ❤ by GitHub

My Solr has 130M documents over 8 shards takes 20min of heavy CPU to start up.

The log showed big time gaps around the suggester being called:

[2/17/15 10:20:33:657 GMT] 000000ca SolrSuggester I org.apache.solr.spelling.suggest.SolrSuggester reload reload()
[2/17/15 10:20:33:657 GMT] 000000ca SolrSuggester I org.apache.solr.spelling.suggest.SolrSuggester build build()

Suggester is configured by default in solrconfig.xml over non existing fields, despite that, and despite its request handler marked as Lazy startup it still does a huge amount of work.

Disabling the suggest searchComponent and requestHandler solved the issue.

I assume that the suggester is building a huge FST in memory.

<!--

<searchComponent name="suggest" class="solr.SuggestComponent">
   <lst name="suggester">
      <str name="name">mySuggester</str>
      <str name="lookupImpl">FuzzyLookupFactory</str>      <!-- org.apache.solr.spelling.suggest.fst -->
      <str name="dictionaryImpl">DocumentDictionaryFactory</str>     <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --> 
      <str name="field">cat</str>
      <str name="weightField">price</str>
      <str name="suggestAnalyzerFieldType">string</str>
    </lst>
  </searchComponent>

  <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
    <lst name="defaults">
      <str name="suggest">true</str>
      <str name="suggest.count">10</str>
    </lst>
    <arr name="components">
      <str>suggest</str>
    </arr>
  </requestHandler>

-->

Later on I saw that someone went through the same thing before me.

Downloaded Solr 4.10 and saw it was disabled in stock solrconfig via to see it's fixed via SOLR-6679.

Yet Another Data Scientist

Sunday, June 14, 2015

Solr soft commit Gotcha - OOM

Conclusion

Thursday, June 11, 2015

Cost of fields in Lucene index

How much does it cost to have 100K unique fields in your Lucene index?

Motivation for many fields

How much it costs?

Conclusion

Thursday, February 19, 2015

Slow Solr Startup - Disabling the Suggester solved it

Thursday, January 1, 2015

JavaTuning.com Vs this blog