Guidebook Engineering Blog Nov 05, 2015 2 minute read
Indexing HTML Documents In Elasticsearch
In preparation for a new “quick search” feature in our CMS, we recently indexed about 6 million documents with user-inputted text into Elasticsearch. We indexed about a million documents into our cluster via Elasticsearch’s bulk api before batches of documents failed indexing with
We noticed huge CPU spikes accompanying the
ReadTimeouts from Elasticsearch. The culprit, as it turned out, was a combination of our user-inputted text and our currently configured text analyzers:
Briefly, ^ we were running our strings through the standard tokenizer, lower casing them, and then ngram tokenizing the strings (e.g. “guidebook” becomes “gui”, “guid”, “guide”, “guideb” etc.). This ngram strategy allows for nice partial matching (for example, a user searching for “guidebook” could just enter “gui” and see results). Unfortunately, the ngram tokenizing became troublesome when users submitted Base64 encoded image files as part of an html document:
The base64 strings became prohibitively long and Elasticsearch predictably failed trying to ngram tokenize giant files-as-strings.
Never fear, we thought; Elasticsearch’s
html_strip character filter would allow us to ignore the nasty
Unfortunately, we continued to receive timeout errors on batches of documents. Closer inspection of those documents revealed bad markup. For example, sometimes documents were missing the opening
< at the beginning of their
img tag thereby “breaking” the
html_strip filter – Elasticsearch was still trying to ngram tokenize base64 encoded image files.
At this point, we decided that ngram tokenizing all of our text was not necessarily what our “quick search” feature needed to succeed. We were more selective about which Elasticsearch fields we applied ngram tokenizing to. Disk use and, perhaps, more importantly
ReadTimeOuts both decreased in frequency.