For a major project at my current day job, we’re using the excellent Thinking
Sphinx Ruby on Rails plugin to provide speedy
and useful search functionality.
The Thinking Sphinx developer has an introduction over on his
blog but here’s the
condensed version. Sphinx is a separate daemon that runs on your system which
indexes a range of attributes and fields back to an integer ‘document id’. In
Rails terms, this means it can index 1 or more of your model attributes to the
model id.
What’s the point when you can just do an find() call on the model and let the
database do the searching? Sphinx is blindingly fast (even with very large
datasets), it can sort by relevance, and it can sort across models.
There are three characteristic of our current project that recently made me
delve into some of the complexities of sphinx.
Our main database was carefully constructed to be utf-8 encoded
The database primarily contains data on books. Generally English language, but plenty of the titles and authors have non English characters in them
Our office is in Australia, so all our users have English language keyboards
Say we have an author whose name is “Alex Čihař”. Authors keep us in business,
so we’d prefer to keep them happy. Therefore, mangling their name on our
promotional material is probably something to avoid, so we store Alex’s name in
our database accents and all and give ourselves a pat on the back.
Then one of our clients calls customer service and asks for a copy of the new
book from Mr Čihař. Our staff member does their best with the keyboard at hand
and searches the system for “Alex Cihar”.
I bet you can see where this is going. No results.
The trick here is to use a thing called character folding. The Unicode
consortium have a technical report on the
topic if you feel like some bedtime reading. The long and the short of it is that
in our particular environment, we wanted all of the following characters to be
equivalent when searching: C c Ć ć Ĉ ĉ Ċ ċ Č č, and much the same for the
accented versions of all the other latin-ish looking characters.
Lucky for us, Sphinx has a feature called
charset_table for
just this situation. It’s purpose in life is to map one Unicode character to
another and make them functionally equivalent in the indexes.
With some processing of a CSV
file provided by the
Unicode consortium, it was possible to build a reasonable value for the
charset_table option.
In Thinking Sphinx, it should go into your config/sphinx.yml file (with no line
breaks):
This charset_table option is saying that:
The characters 0->9, a->z and _ should be considered their usual values
A should be considered equal to a
M should be considered equal to m
Unicode Codepoint U+00C0 (À) should be considered equal to a.
Unicode Codepoint U+1EEC (Ử) should be considered equal to u.
etc.
Anything not listed in the table (ie. ß, punctuation, Hebrew characters, Japanese
characters) is considered white space and cannot be searched for. If
these extra characters are important to you, don’t forget to add them in.
If you’re not using Thinking Sphinx, then the value can go directly into your
Sphinx config file under the appropriate index. Without Thinking Sphinx, you
should also make sure that your index is set to utf-8 mode using the
charset_type setting.
Mr Čihař is happy, our customer is happy, and we don’t lose a potential sale.
Winners all around.