Article

jun2001.tar

Poor Man's Search Engine

Br. David Carlson

What do you do if you want a keyword search engine on your Web site but have no funds for this? Some freeware solutions are available but may also deliver annoying ads. However, if your Web site is not heavily used, you may be able to develop your own "quick and dirty" solution. This article shows how I handled this problem in Linux. The result was an open-source, freeware search engine called QSearch. It consists primarily of a bash shell script and a compiled CGI program written in C++. Although our server uses Red Hat Linux and the Apache Web server, the software may be adaptable to other settings.

What's the Plan?

I began by looking at the data. A search engine on a page must allow the user to look up Web pages by specifying one or more keywords or phrases. Web pages often use tags such as the following at the top of each file. These tags give a description of the page and keywords by which the page can be retrieved:

<META NAME="DESCRIPTION" CONTENT="Lab Problem Report Form">
<META NAME="KEYWORDS" CONTENT="Problem Report,Lab Problem,Problem">

A CGI (common gateway interface) script could be used as a search engine, but if it had to search through all of the HTML files on a Web site to find the above type of META tags each time someone did a search, it would probably run too slowly. We may not need blazing speed, but we do want results before the user gives up on our search engine. Greater speed could be obtained by collecting the data from these META tags and saving it all in a single file, which could then be more quickly searched by a CGI script. The following example shows the format that was used for this text file:

/java.html|Java Information#Java#Java Information#CS 310#CS310#
/jobs.html|Computing Career Links#Career Links#Career#Careers#
/itwd/milestones.html|Milestones for Grant Project#Milestones#ITWD#
/carlsond/cs321/web/javascript.html|Notes on JavaScript#JavaScript#

Each line contains the information about a particular Web page. The part before the pipe symbol (|) is the URL, minus the invariant leading section, which was http://cis.stvincent.edu in my case. Between the "|" and the first "#" symbol is the description of the Web page, and between each neighboring pair of "#" symbols, there is a keyword or phrase. This format allows us to later find a particular keyword or phrase and to distinguish it from the description and URL.

Thus, I hatched a plan to periodically run a bash script to collect the data and place it in a text file. To do a search, users can access an HTML form where they can fill in their keyword(s). Clicking on the submit button on the form will send the desired keyword(s) to a second bash script, the CGI search engine. A compiled program to provide better security and speed later replaced this second script. For the moment, however, we will look at the scripts because they show what processing needs to be done.

The Shell Scripts

Harvesting the Data

The getmeta script (Listing 1) gathers the data and saves it in a file called "keywordfile". This script can automatically run every night so that as users add Web pages and adjust the META tag information, the file is kept up to date. (Listings for this article are available from: www.sysadminmag.com.)

You must decide which Web files to include in the scope of the search engine. The getmeta script uses the find command to scan all files with names of the form *.html that are in the directory whose name is in the TARGET variable. By default, TARGET contains /www. The script also scans all Web files in the directory trees that begin with any of the subdirectories named in the SUBDIRS variable. On this system, I only wanted to descend into the directory trees for a few particular users or projects. This section should obviously be adjusted to suit each particular Web site.

The getmeta script writes the names of the specified Web files to a temporary text file, whose name is stored in the variable "TMP". A file is used because the amount of data might be fairly large if many Web files are processed. The filenames are then read, one at a time, from this temporary file using a loop that has its input redirected to come from the file. The rough outline of this loop is as follows:

while read filename
do
   Process the filename as desired
done < $TMP

This is a pattern with many useful applications. The script processes each filename by first checking whether users have read access, as there is no sense in scanning a file that users cannot read. The cut command is used to look for the "r" permission in the correct column of a long listing for the file. The column number might need to be adjusted to fit with the long listing format on your system.

Next, the script uses grep to get the lines of each file that contain a tag starting with the string "<meta ". The output is piped to another grep, which keeps only those lines that also contain ="keywords". This output is placed in the KEYS variable. If the data in this variable is nonzero in size, a similar grep is used to extract the description string. The data in KEYS is then piped into a cut that extracts the third field where "=" is used as the delimiter. This skips over the NAME= and CONTENT= part of the line, giving the keywords section that follows the second "=". Then the translate command (tr) is used to replace the commas by # symbols and to delete the "> that ends the meta tag line. The # symbols are used to give us the file format already described above for keywordfile. The description data held in the DESCRIP variable is then refined in a similar way.

Following this, cut is used to skip over the first few characters of the filename. When using the default settings, where all HTML files are under the /www directory, this amounts to skipping over the initial /www and thus keeping the characters in column 5 onward. We do not want the /www as it will not be part of the final URL for this file. The line starting with the echo command is then used to send the modified filename, description, and keywords to another temporary file. Note that the "|" symbol is inserted to separate the filename from the description and that the keywords are surrounded by # symbols.

The tr -d '\015' may not be needed on some servers. It removes carriage returns that got inserted when users edited their Web files from Windows-based machines. This can happen on servers that run the Samba software (http://www.samba.org) that allows a UNIX or Linux machine to imitate an NT server. By removing the carriage returns, we get a proper UNIX text file. At the very end, the getmeta script copies the newly harvested data over the top of any existing keywordfile and removes its temporary files.

Matchmaker, Matchmaker

The search program began as a bash script (Listing 2). It is a CGI script, so your Web server must be configured to allow CGI scripts to run. This script is the program that receives the data from the Web-based form when the submit button is clicked. See Listing 3 for the search.html file. It contains the form with three text boxes that the user can fill in so as to specify up to three keywords or phrases for which to search. Actually, the data submitted from the form is sent (as one long URL-encoded string) to the uncgi program, which parses the data and places it into environment variables that start with the characters "WWW_". You can see that uncgi is specified in the following line of the search.html file:

<FORM METHOD="POST" ACTION="../cgi-bin/uncgi/search">

The CGI script can then access the data in these variables, here named WWW_Key1, WWW_Key2, and WWW_Key3. The uncgi program, normally installed in the cgi-bin directory, thus breaks apart a URL-encoded data string such as the following:

Key1=Java&Key2=Java+script&Key3=VB+script

This is the type of data string sent from the user's Web browser. It would be awkward to handle this string directly in the script. Instead, the above POST command specifies that the string is sent to the uncgi program, which conveniently places the data into three separate environment variables that the CGI script can then easily access. It is as if we did the following assignments:

WWW_Key1="Java"
WWW_Key2="Java script"
WWW_Key3="VB script"

The uncgi program is available at a number of Web sites (e.g., http://www.prw.net/support/cgi/uncgi.htm).

The first section of the search script sets up some variables and increments a counter to show that the search engine has been accessed one more time. The count itself is kept in a text file that can be examined whenever you want to know how much your search engine has been used. The count could also be displayed on the Web page showing the results of each search, if desired.

Next, the script figures out which of the WWW variables contain nothing (zero bytes) with the -z test. The values in the variables are copied as needed so that all three variables contain a value, where duplicates are used to fill in for an empty value. For example, if the user enters C++ in a text box and leaves the other two boxes blank, the script copies C++ into the variables corresponding to these two other boxes.

The grep -i command is then used to do a case-insensitive search in keywordfile for the keyword in the first WWW variable. The # symbols are used to be sure that any match is for a keyword and not a word that simply appears in a description or URL. The results of this grep are piped into a second grep that looks for the keyword in the second WWW variable. The output of this last command is piped into a third grep that looks for the keyword given by the last WWW variable. Thus, we get just those lines of the keywordfile that contain all three keywords in the keyword section of the line. This data is written to a temporary file.

The search script now outputs what needs to be sent to the user's Web browser to display a page about any matches that were found. We first print out the string Content-type: text/html, followed by a blank line. Then we output the information about matches, marked up with appropriate HTML tags. To make things easier, the initial lines of HTML are copied from the file named by the HEAD variable.

The -z test is used to see if there was no keyword for which to search. If so, an error message is displayed on the user's Web page. Next, we reuse our favorite loop pattern, with input redirected to come from the temporary file of matches:

while read item
do
   Process item as need be
done < $TMP

Note that the number of lines (matches) is counted in the MATCHES variable. The filename is extracted from each item (line) by using cut -d "|" -f 1, which obtains the first field that is delimited by the | symbol. The description field is extracted in a similar way, though two uses of cut are needed: one to extract the second field delimited by |, and the other to get the first field delimited by #.

The data for each match is then written out as a list item in an ordered list. The URL portion is written out as a clickable link, with TOPURL preceding the filename so as to give a complete URL. The TOPURL variable contains the value ("http://cis.stvincent.edu" on our server) that must precede all Web filenames on the system. Note that if Web files on your system are scattered under users' home directories, then this will not work. The description is written out immediately after the URL. The Web server automatically sends this output to the user's browser.

Finally, the search script checks the number of matches to see whether it was zero so that a message can be printed for that special case. Some closing HTML is written out, and the script is done.

Security Concerns

Although the above search script worked fine, there were a couple of concerns. CGI scripts are often susceptible to hacking attempts, such as supplying bad input like the following:

C++ | cat /bin/passwd

A hacker could submit this via the form in search.html with the hope that the | symbol would cause the script to execute the command to cat out the password file. This would show the login IDs for all users, allowing the hacker to try a dictionary attack on user passwords. Although this type of attack did not seem to work on our system, it might be possible for someone to find an attack that worked. Information on setting up the Apache Web server to reduce security problems can be found at the Apache Web site (http://www.apache.org). Information about CGI security problems can be found in "Safer CGI Scripting" by Charles Walker and Larry Bennett, Sys Admin, February 2001.

Another concern is that a script is interpreted and runs more slowly than a compiled program. Because our search engine was not being run often and the response time was brief, this was not a big concern. Still, usage might increase in the future, so it could help to have a compiled program as the search engine. Because it is also easier with a compiled program to weed out bad input that might indicate an attack, I decided to switch to a compiled C++ program.

Using a Compiled CGI Program

The search.cpp program performs the same overall task as the search CGI script. However, the GetValue function, found in stringhelp.cpp, does some additional processing to reject bad input. This function gets the value of a WWW environment variable. The function is careful not to overflow the Result array when copying characters from the environment variable. It also only copies alphanumeric characters, the + sign, the - sign, the period character, the space character, and the NULL character that marks the end of a string. At the first sign of any other character (such as |, or other hacker favorites), the function quits and returns the empty string in Result. You can adjust the code in GetValue if you want to allow additional characters, but be careful what you allow. The rest of the C++ program will not be examined here as it does the same processing as the old search script.

Configuration and Installation

QSearch can be downloaded from my Web site:

http://cis.stvincent.edu/carlsond/software/software.html

A Readme file is included to explain configuration and installation issues. The main steps are covered here.

What do you need in order to use this software? It may be a reasonable choice if you do not have a huge number of Web pages and the search engine will not be heavily used. You need to have the g++ compiler to compile the C++ program. You also need to have the uncgi program installed, and the Web server must be configured to run CGI programs. The software assumes that all Web pages to be searched are located under a common directory, such as the default /www directory. The META tags containing the keywords and descriptions must fit the format mentioned at the start of this article.

Edit the getmeta script to adjust the following four lines for your situation:

TARGET="/www"
START=5
SUBDIRS="/www/itwd /www/carlsond /www/carrc /www/morrisoh /www/hicksb"
KEYFILE="/www/cgi-bin/keywordfile"

Note that TARGET should indicate the directory under which all of your Web files are located. The START number must be one more than the number of characters in this TARGET string. SUBDIRS should hold a string containing any subdirectories of the TARGET directory into which you want to descend to find Web files. Finally, KEYFILE should give the location for the keywordfile that getmeta generates, a location within the directory for CGI programs.

Edit the search.cpp file and adjust the following lines as needed:

#define TOPURL     "http://cis.stvincent.edu"
#define HEAD       "head.html"
#define KEYFILE    "keywordfile"
#define COUNTFILE  "search.count"

The first line should give the initial common part of all URLs on your Web site. The second gives the name of the HTML file that is to be displayed as the top part of the search-results Web page. A sample head.html file is supplied. The third and fourth lines probably do not have to be changed, although it is important that KEYFILE contain the exact name for the text file created by the getmeta script. The only other item that you might want to modify is the following line of the file stringhelp.h:

const int StrMax = 800

This should be set to a reasonable maximum string length. Remember that each line of META tag data is stored in one of these strings. If your Web pages contain META tag lines with long descriptions or long lists of keywords, it is possible that you will need a longer StrMax. Now start the compiler with:

g++ search.cpp stringhelp.cpp -o search -s

This should produce an executable search program. Then log in as root and change the permissions and ownership of the getmeta script as follows:

chmod 700 getmeta
chown root.root getmeta

Use root's crontab entry to schedule getmeta to be run once a day:

crontab -e

The crontab entry might look like the following if you want to run getmeta at 5:45 a.m. every day. Adjust the path to getmeta as needed:

45 5 * * * /usr/local/bin/getmeta

Move head.html and your compiled search program to the directory for your CGI programs. Change the permissions and ownership as follows:

chown root.root head.html search
chmod 755 search
chmod 644 head.html

In this same CGI directory, create a file named search.count and place the number 0 into it. This file will contain the count of how many times the search engine has been used. You can create this file as follows:

echo 0 > search.count
chown root.root search.count
chmod 666 search.count

You may be able to be more restrictive here. For example, if your Webserver runs as user Webmaster, change the last two commands to:

chown webmaster.users search.count
chmod 644 search.count

Then ordinary users cannot change the file, although they can read it.

Place the search.html file in any directory of Web pages you wish. You can edit this file to add graphics or other enhancements. Adjust ownership and permissions as follows:

chown root.root search.html
chmod 644 search.html

Add a link to search.html in those Web pages where you wish to provide access to the search engine.

You are now ready to take the search engine for a test drive. Run getmeta once manually as root to produce a keywordfile. This can be done by entering ./getmeta while in the directory that contains this script. Then use a Web browser to look at the search.html file and try out the search engine.

Possible Improvements

One enhancement is to allow spaces after the commas separating keywords in the META tags. This was not done here in order to keep the programming simple. It might also be desirable to allow more general types of searches than just exact matches for keywords, although that would also complicate the programming. The getmeta script could probably be written more compactly in Perl. You could even write your own search daemon if you know sockets programming. However, such a complex programming project would take us away from the original goal of having a simple, easy-to-produce search engine. Refer to:

http://cis.stvincent.edu/carlsond/cs330/unix/unix.html

for my UNIX Web page with links to UNIX and Linux information, including tables and examples on writing shell scripts. Refer to:

http://cis.stvincent.edu/carlsond/swdesign/

for my Web pages on C++ programming, which might be helpful in modifying the search program.

Br. David Carlson is a Benedictine monk as well as chairperson and associate professor in the Computing & Information Science Department at Saint Vincent College. When his primary jobs allow, he can often be found doing systems administration on his department's Linux server. He can be reached at: carlsond@stvincent.edu.