Build
a Scalable Search Engine with LISA
Robert J. Bond 3rd
Off-the-shelf Web search software can be a snap to install and
configure, but even the best solutions may have limitations, whether
technical or license-related. If you have special search needs,
you may want to build your own search engine, but what about scalability?
Many homegrown solutions (and some commercial ones) don't scale
well enough to handle tens or even hundreds of thousands of documents.
However, building a scalable solution is straightforward if you
start with a good blueprint.
In this article, I will show how to build a search engine based
on a simple design that I call LISA (lightweight search). The benefit
of LISA is industrial-strength search with unlimited fine-grained
control over indexing and retrieval. For simplicity and performance,
I use MySQL and Perl on Solaris and Linux, but this solution is
general enough to be adapted to other tools and platforms.
The Standard Way
Most search solutions are similar in overall design. Typically:
1. A spider collects the document and extracts the text.
2. An indexer makes appropriate entries in a storage system.
3. A query script fetches results from the storage system.
Spidering and text extraction are relatively easy tasks. This
article emphasizes the trickier problems of storage and retrieval
because a Web search engine's scalability is largely a function
of the storage system.
Scalable Storage
Many homegrown solutions employ a so-called inverted index design.
For each word (e.g., habañero), a script makes an entry in
a flat file consisting of the word and the URLs of all documents
containing the word. A number of clever techniques exist to speed
access to the data (e.g., by storing all the URLs for habañero
in a single text file of the same name), or you may use a hashed
file for greater speed or easier backup.
When it's time to do a query, a script simply fetches all
the URLs for each term, loads the URLs into separate arrays, performs
some Boolean logic (union, intersection, and/or difference) and
presents the results to the user. It sounds great, and this type
of solution works in many cases. However, the program is computing
the final query results in memory, using arrays that grow larger
as more documents are indexed. Eventually, the host system will
run out of RAM (not to mention other system resources). Obviously,
there is a scaling problem.
How can we compute the union, intersection, and difference of
large sets, in a scalable way? The answer is to use a SQL database
as the backend. A SQL database can efficiently handle millions of
rows of data, and offers the systems administrator unlimited control
via SQL statements. We just need to set up a database with a couple
of tables, and feed the proper SQL query to the database. But before
we set up our database, we need a good relational model.
The Relational Model
Using an SQL database is no guarantee of good performance unless
our relational model is well designed. Table 1 is based on a lot
of trial and error and actually shows two tables -- a doc table
and a word table. The doc table contains the URL and title for each
page, along with an auto-generated ID. The word table contains every
word found in a given document, along with the ID of the document
in which the word is found, and the position of the word in the
document (counting forward from the beginning of the document text).
This deceptively simple design allows us to do sophisticated queries
against a potentially massive number of documents. Listing 1 contains
the SQL code to generate the database tables. I've also added
a third table, spider, which will hold the URL's we've
extracted from various documents, so that we may index those as
well.
Spidering
Because the emphasis of this article is storage and retrieval,
I will use a very basic spider (Listing 2). The spider simply reads
a URL from the spider table, updates the item's status, then
requests the item's headers (via an HTTP HEAD request) and
makes sure that the item's type is "text/html" (because
we don't want to index binary files). It requests the item
if it is "text/html", extracts the text and links, and
stores the results in the database.
Put at least one URL in the spider table:
INSERT INTO spider (url) VALUES ('http://www...');
Run the spider script once, and then examine the contents of the spider
table:
SELECT * FROM spider;
Repeat this for the doc and word tables. Run the spider script several
times so the tables will grow. If you are going to run a large search
engine, you will want to tune the spider very carefully. (See the
comment "Insert links into the spider table" in Listing
2.)
Getting Results
To query the database, we need to construct a query that joins
the two tables and produces the correct result. If your SQL is a
little rusty, you may want to brush up on joins before moving forward.
The retrieval queries comprise three parts:
- The SELECT clause (specifying columns to output)
- The FROM clause (specifying which tables to use, and
how to join them)
- The WHERE clause (specifying our search criteria)
Here is a simple query for all pages containing a single word,
"taquito":
SELECT DISTINCT url,title FROM doc INNER JOIN word ON doc.id=word_id WHERE word='taquito';
The query returns a table (Table 2) with two columns (URL and title).
To find all documents containing either of two words (Boolean
OR), the WHERE clause is slightly modified:
SELECT DISTINCT url,title FROM doc INNER JOIN word ON doc.id=word_id \
WHERE word IN ('tortilla','burrito');
To search for multiple terms in a Boolean AND fashion, you
must do a series of nested self-joins. A self-join is simply a join
between a table and itself (in this case, the word table). In order
to refer to the table multiple times in a single query, use a table
alias (which is simply a nickname for the table). For simplicity,
the letters a, b, c, etc. are used. Our AND query looks like
this:
SELECT DISTINCT url,title
FROM doc
INNER JOIN word as a ON doc.id=a.word_id
INNER JOIN word as b ON doc.id=b.word_id
WHERE a.word='burrito' AND b.word='tortilla'
Queries with NOT
To find all the pages with the words "tortilla" and
"burrito", but not "pepper", the following unusual
solution works well. First, create a temporary table containing
all the pages with pepper, then use a "left outer join"
(explained below) to exclude these items.
First, create the temporary table. (This statement is MySQL-specific):
CREATE TEMPORARY TABLE tmp
SELECT id FROM doc
INNER JOIN word as a ON doc.id=a.word_id
WHERE a.word='pepper';
And now the actual query:
SELECT DISTINCT url,title
FROM doc
INNER JOIN word as a ON doc.id=a.word_id
INNER JOIN word as b ON doc.id=b.word_id
LEFT OUTER JOIN tmp on doc.id=tmp.id
WHERE a.word='burrito' AND b.word='tortilla' AND tmp.id IS NULL
What's Happening Here?
Well, to begin we find all the pages that contain our NOT
term (pepper), via the temporary table. Next, in the SELECT
statement that follows, do a left outer join (i.e., a "best
effort" to join, which will leave items in the left-hand column
in the results even if no corresponding item exists in the right-hand
table).
Finally, we must eliminate pages containing pepper by ensuring
that if the left outer join is successful (i.e., if there are pages
containing all three query terms), the id in the temporary table
must be null. Of course, the id column will never be null, but it's
sufficient to exclude any pages appearing in the temporary table
from our results.
Proximity and Phrases
It's easy to find exact phrases because we have stored the
position of each word in every document. To find documents containing
the phrase "New Mexico", we simply narrow our query to
those pages in which the position of "new" is equal to
the position of "mexico" minus one:
SELECT DISTINCT url,title
FROM doc
INNER JOIN word as a ON doc.id=a.word_id
INNER JOIN word as b ON doc.id=b.word_id
WHERE a.word='new' AND b.word='mexico' AND a.position = b.position - 1
We're simply retrieving all documents containing both new and
mexico where the position of mexico equals the position of new plus
one.
To find "southwestern" near "cooking", we
simply specify that each term should occur within 10 words, for
example, of the other:
SELECT DISTINCT url,title
FROM doc
INNER JOIN word as a ON doc.id=a.word_id
INNER JOIN word as b ON doc.id=b.word_id
WHERE a.word='southwestern' AND b.word='cooking' AND ABS(a.position - b.position) < 11
In this example, the ABS() function returns the absolute value
of the difference between the positions of the two words.
Sort Order
The best way to return relevant results to the user is to be selective
during the spidering phase, indexing only worthwhile sites in the
first place. However, users will expect some kind of intelligent
sort order, especially when viewing large result sets. You can assign
an editorial quality rating to sites prior to spidering, or if currency
is the most relevant factor in your situation, you can capture each
page's "last modified" timestamp (typically provided
by the HTTP "Last-modified" header) during spidering.
Whatever you choose as your sort field, all you need to do is
capture the appropriate information, and store it in a new column
in the doc table, then append an ORDER BY clause to any SQL
query. If your column is called last_mod, append the following
to any query:
ORDER BY last_mod
View Cached Document
If you wanted to reconstruct the entire text as it was indexed, perhaps
to emulate Google's cached document function, simply pull out
all the words for a given document and sort by position:
SELECT word FROM word WHERE id=111 ORDER BY position
Customizing and Extending
The LISA blueprint may be extended in a number of ways. For example,
by default we are indexing all words and their positions in each
document. This results in an average of 100 kilobytes of storage
for each document. If storage space is at a premium, you might ignore
stopwords (e.g., "the", "of", etc.) To use a
minimum of storage space, simply forego storing position information
altogether (an acceptable option if proximity searching is not required).
Since LISA uses SQL, it's easy to generate reports on the
number of documents indexed, average number of words per document,
frequency counts for words, etc. You may set up a cron job to search
the database for certain words and then delete all documents containing
such words, or you may want to add search capabilities (e.g., the
ability to limit searches to a certain domain).
Conclusion
If you are frustrated by the limitations of ready-made search
solutions, you can always roll your own search engine, using standard
tools like MySQL and Perl, with the LISA blueprint. A lightweight
yet scalable design using SQL means you will avoid common bottlenecks
associated with less robust homegrown solutions. Best of all, it's
easy to add new functionality or to mimic the functionality of other
search engines.
Robert J. Bond 3rd is the Web manager for The College of Santa
Fe in New Mexico. The author would like to thank David Bond for
his support on the LISA project. Robert can be contacted at: rbond@csf.edu.
|