jun2002.tar

Build a Scalable Search Engine with LISA

Robert J. Bond 3rd

Off-the-shelf Web search software can be a snap to install and configure, but even the best solutions may have limitations, whether technical or license-related. If you have special search needs, you may want to build your own search engine, but what about scalability? Many homegrown solutions (and some commercial ones) don't scale well enough to handle tens or even hundreds of thousands of documents. However, building a scalable solution is straightforward if you start with a good blueprint.

In this article, I will show how to build a search engine based on a simple design that I call LISA (lightweight search). The benefit of LISA is industrial-strength search with unlimited fine-grained control over indexing and retrieval. For simplicity and performance, I use MySQL and Perl on Solaris and Linux, but this solution is general enough to be adapted to other tools and platforms.

The Standard Way

Most search solutions are similar in overall design. Typically:

1. A spider collects the document and extracts the text.

2. An indexer makes appropriate entries in a storage system.

3. A query script fetches results from the storage system.

Spidering and text extraction are relatively easy tasks. This article emphasizes the trickier problems of storage and retrieval because a Web search engine's scalability is largely a function of the storage system.

Scalable Storage

Many homegrown solutions employ a so-called inverted index design. For each word (e.g., habañero), a script makes an entry in a flat file consisting of the word and the URLs of all documents containing the word. A number of clever techniques exist to speed access to the data (e.g., by storing all the URLs for habañero in a single text file of the same name), or you may use a hashed file for greater speed or easier backup.

When it's time to do a query, a script simply fetches all the URLs for each term, loads the URLs into separate arrays, performs some Boolean logic (union, intersection, and/or difference) and presents the results to the user. It sounds great, and this type of solution works in many cases. However, the program is computing the final query results in memory, using arrays that grow larger as more documents are indexed. Eventually, the host system will run out of RAM (not to mention other system resources). Obviously, there is a scaling problem.

How can we compute the union, intersection, and difference of large sets, in a scalable way? The answer is to use a SQL database as the backend. A SQL database can efficiently handle millions of rows of data, and offers the systems administrator unlimited control via SQL statements. We just need to set up a database with a couple of tables, and feed the proper SQL query to the database. But before we set up our database, we need a good relational model.

The Relational Model

Using an SQL database is no guarantee of good performance unless our relational model is well designed. Table 1 is based on a lot of trial and error and actually shows two tables -- a doc table and a word table. The doc table contains the URL and title for each page, along with an auto-generated ID. The word table contains every word found in a given document, along with the ID of the document in which the word is found, and the position of the word in the document (counting forward from the beginning of the document text).

This deceptively simple design allows us to do sophisticated queries against a potentially massive number of documents. Listing 1 contains the SQL code to generate the database tables. I've also added a third table, spider, which will hold the URL's we've extracted from various documents, so that we may index those as well.

Spidering

Because the emphasis of this article is storage and retrieval, I will use a very basic spider (Listing 2). The spider simply reads a URL from the spider table, updates the item's status, then requests the item's headers (via an HTTP HEAD request) and makes sure that the item's type is "text/html" (because we don't want to index binary files). It requests the item if it is "text/html", extracts the text and links, and stores the results in the database.

Put at least one URL in the spider table:

INSERT INTO spider (url) VALUES ('http://www...');

Run the spider script once, and then examine the contents of the spider table:

SELECT * FROM spider;

Repeat this for the doc and word tables. Run the spider script several times so the tables will grow. If you are going to run a large search engine, you will want to tune the spider very carefully. (See the comment "Insert links into the spider table" in Listing 2.)

Getting Results

To query the database, we need to construct a query that joins the two tables and produces the correct result. If your SQL is a little rusty, you may want to brush up on joins before moving forward.

The retrieval queries comprise three parts:

The SELECT clause (specifying columns to output)
The FROM clause (specifying which tables to use, and how to join them)
The WHERE clause (specifying our search criteria)

Here is a simple query for all pages containing a single word, "taquito":

SELECT DISTINCT url,title FROM doc INNER JOIN word ON doc.id=word_id WHERE word='taquito';

The query returns a table (Table 2) with two columns (URL and title).

To find all documents containing either of two words (Boolean OR), the WHERE clause is slightly modified:

SELECT DISTINCT url,title FROM doc INNER JOIN word ON doc.id=word_id \
  WHERE word IN ('tortilla','burrito');

To search for multiple terms in a Boolean AND fashion, you must do a series of nested self-joins. A self-join is simply a join between a table and itself (in this case, the word table). In order to refer to the table multiple times in a single query, use a table alias (which is simply a nickname for the table). For simplicity, the letters a, b, c, etc. are used. Our AND query looks like this:

SELECT DISTINCT url,title
FROM doc
INNER JOIN word as a ON doc.id=a.word_id
INNER JOIN word as b ON doc.id=b.word_id
WHERE a.word='burrito' AND b.word='tortilla'

Queries with NOT

To find all the pages with the words "tortilla" and "burrito", but not "pepper", the following unusual solution works well. First, create a temporary table containing all the pages with pepper, then use a "left outer join" (explained below) to exclude these items.

First, create the temporary table. (This statement is MySQL-specific):

CREATE TEMPORARY TABLE tmp
SELECT id FROM doc
INNER JOIN word as a ON doc.id=a.word_id
WHERE a.word='pepper';

And now the actual query:

SELECT DISTINCT url,title
FROM doc
INNER JOIN word as a ON doc.id=a.word_id
INNER JOIN word as b ON doc.id=b.word_id
LEFT OUTER JOIN tmp on doc.id=tmp.id
WHERE a.word='burrito' AND b.word='tortilla' AND tmp.id IS NULL

What's Happening Here?

Well, to begin we find all the pages that contain our NOT term (pepper), via the temporary table. Next, in the SELECT statement that follows, do a left outer join (i.e., a "best effort" to join, which will leave items in the left-hand column in the results even if no corresponding item exists in the right-hand table).

Finally, we must eliminate pages containing pepper by ensuring that if the left outer join is successful (i.e., if there are pages containing all three query terms), the id in the temporary table must be null. Of course, the id column will never be null, but it's sufficient to exclude any pages appearing in the temporary table from our results.

Proximity and Phrases

It's easy to find exact phrases because we have stored the position of each word in every document. To find documents containing the phrase "New Mexico", we simply narrow our query to those pages in which the position of "new" is equal to the position of "mexico" minus one:

SELECT DISTINCT url,title
FROM doc
INNER JOIN word as a ON doc.id=a.word_id
INNER JOIN word as b ON doc.id=b.word_id
WHERE a.word='new' AND b.word='mexico' AND a.position = b.position - 1

We're simply retrieving all documents containing both new and mexico where the position of mexico equals the position of new plus one.

To find "southwestern" near "cooking", we simply specify that each term should occur within 10 words, for example, of the other:

SELECT DISTINCT url,title
FROM doc
INNER JOIN word as a ON doc.id=a.word_id
INNER JOIN word as b ON doc.id=b.word_id
WHERE a.word='southwestern' AND b.word='cooking' AND ABS(a.position - b.position) < 11

In this example, the ABS() function returns the absolute value of the difference between the positions of the two words.

Sort Order

The best way to return relevant results to the user is to be selective during the spidering phase, indexing only worthwhile sites in the first place. However, users will expect some kind of intelligent sort order, especially when viewing large result sets. You can assign an editorial quality rating to sites prior to spidering, or if currency is the most relevant factor in your situation, you can capture each page's "last modified" timestamp (typically provided by the HTTP "Last-modified" header) during spidering.

Whatever you choose as your sort field, all you need to do is capture the appropriate information, and store it in a new column in the doc table, then append an ORDER BY clause to any SQL query. If your column is called last_mod, append the following to any query:

ORDER BY last_mod
View Cached Document

If you wanted to reconstruct the entire text as it was indexed, perhaps to emulate Google's cached document function, simply pull out all the words for a given document and sort by position:

SELECT word FROM word WHERE id=111 ORDER BY position

Customizing and Extending

The LISA blueprint may be extended in a number of ways. For example, by default we are indexing all words and their positions in each document. This results in an average of 100 kilobytes of storage for each document. If storage space is at a premium, you might ignore stopwords (e.g., "the", "of", etc.) To use a minimum of storage space, simply forego storing position information altogether (an acceptable option if proximity searching is not required).

Since LISA uses SQL, it's easy to generate reports on the number of documents indexed, average number of words per document, frequency counts for words, etc. You may set up a cron job to search the database for certain words and then delete all documents containing such words, or you may want to add search capabilities (e.g., the ability to limit searches to a certain domain).

Conclusion

If you are frustrated by the limitations of ready-made search solutions, you can always roll your own search engine, using standard tools like MySQL and Perl, with the LISA blueprint. A lightweight yet scalable design using SQL means you will avoid common bottlenecks associated with less robust homegrown solutions. Best of all, it's easy to add new functionality or to mimic the functionality of other search engines.

Robert J. Bond 3rd is the Web manager for The College of Santa Fe in New Mexico. The author would like to thank David Bond for his support on the LISA project. Robert can be contacted at: rbond@csf.edu.