During a recent Rails project at IBM we had to deal with a large table consisting of customers. The table is made up of legacy enterprise data, and contains close to a million records.
Among many other fields, the table Customers
includes a column name
defined as VARCHAR
. name
is used to store company names.
One of the requirements for the project was to implement a Google Suggest-like feature. That is, when the user enters a few characters for the customer’s name, they should be prompted, via AJAX, with a list of possible suggestions containing that string (or at least beginning with that string.)
The naive approach would be to use a simple LIKE
predicate, to generate queries such as:
SELECT name
FROM Customers
WHERE name LIKE '%micro%'
ORDER BY name
One of the main problems with this approach is performance. Using LIKE
this way doesn’t allow us to take advantage of indexes that are defined on the column name
.
Unsurprisingly, on a modest server, such a query takes about 25 seconds. A UI that isn’t able to respond for 25 seconds will feel like an eternity for the end user, of course, so a simple query like this obviously cannot be seen as a viable solution.
Enter DB2 Text Search. You may be familiar with other full text search engines such as Ferret, Lucene, or Sphinx. DB2 V9.7.4 ships with an extremely powerful full text search engine that has plenty of useful features, including excellent integration with native XML columns.
Using the free edition of DB2 (DB2 Express-C) which includes Text Search, we were able to implement the autocomplete functionality we were after in a heartbeat, thanks to a query like the following:
SELECT name
FROM Customers
WHERE CONTAINS(name, 'micro') = 1
ORDER BY name
This query was executed in mere fractions of a second for most searches, and behaved exactly as needed. In fact, not only did we match strings that began with the searched token, but also ones that contained it elsewhere (e.g., micro
would match both Microsoft
and Sun Microsystems
.) The results where ordered alphabetically, but could have easily been ordered by relevance via the SCORE
function, also available via DB2 Text Search.
As usual, it’s a matter of using the right tool for the right job, and DB2 Text Search was created exactly for these kind of scenarios.
Let’s briefly look at how you can also go about setting up and playing with it on Linux.
Installing DB2 Text Search
To start with, you’ll need to have a 64bit Linux distro, and then follow these simple steps.
Download DB2 Express-C. Make sure you grab the db2exc_974_LNX_x86_64.tar.gz
file and not the Light edition.
Next, install DB2 by following these steps (the Installing DB2 section in particular.) Make sure you select a custom installation, and that you select everything including DB2 Text Search (which is not checked by default).
After you’ve started DB2 with db2start
via the instance user (e.g., db2inst1
,) launch the text search server by running:
db2ts start for text
To enable text search for an existing database, run:
db2ts enable database for text connect to mydb
Next, you’ll need to create a text search index and populate it. For example:
db2ts "create index customer_name for text on customers(name) connect to mydb" db2ts "update index customer_name for text connect to mydb"
That’s it. From now on you’ll be able to use functions such as CONTAINS
to quickly search for your data.
Such a function can also be easily invoked in Rails:
Customer.where("CONTAINS(name, ?) = 1", name)
Or wrapping it a little in a model:
class Customer < ActiveRecord::Base
# ...
def self.search(name, max_res = 10)
where('CONTAINS(name, ?) = 1', name).
order('name').
limit(max_res)
end
end
Which can then be invoked as follows:
Customer.search("micro")
(Technically we wouldn’t have needed limit
, so long as we passed RESULTLIMIT
to the CONTAINS
function.)
You can read more about DB2 Text Search at the Information Center.
Get more stuff like this
Subscribe to my mailing list to receive similar updates about programming.
Thank you for subscribing. Please check your email to confirm your subscription.
Something went wrong.
Queries using like “name%” actually should use index on name. It’s called prefix search. And normally you would also have a limit clause there, since you really don’t need to retrieve all matches when someone queries for something general like “peter%”.
What you probably wanted to write is a query with “%name%”. It has the same semantics as CONTAINS.
Good spotting, Johno. It was meant to be ‘%name%’ in the article from the very beginning (which was what took 25+ seconds). I fixed it now.
DB2 text search cannot distinguish between C++ and C#
Is this likely to be fixed any time?
thanks