Poor
Man's Search Engine
Br. David Carlson
What do you do if you want a keyword search engine on your Web
site but have no funds for this? Some freeware solutions are available
but may also deliver annoying ads. However, if your Web site is
not heavily used, you may be able to develop your own "quick
and dirty" solution. This article shows how I handled this
problem in Linux. The result was an open-source, freeware search
engine called QSearch. It consists primarily of a bash shell script
and a compiled CGI program written in C++. Although our server uses
Red Hat Linux and the Apache Web server, the software may be adaptable
to other settings.
What's the Plan?
I began by looking at the data. A search engine on a page must
allow the user to look up Web pages by specifying one or more keywords
or phrases. Web pages often use tags such as the following at the
top of each file. These tags give a description of the page and
keywords by which the page can be retrieved:
<META NAME="DESCRIPTION" CONTENT="Lab Problem Report Form">
<META NAME="KEYWORDS" CONTENT="Problem Report,Lab Problem,Problem">
A CGI (common gateway interface) script could be used as a search
engine, but if it had to search through all of the HTML files on a
Web site to find the above type of META tags each time someone did
a search, it would probably run too slowly. We may not need blazing
speed, but we do want results before the user gives up on our search
engine. Greater speed could be obtained by collecting the data from
these META tags and saving it all in a single file, which could then
be more quickly searched by a CGI script. The following example shows
the format that was used for this text file:
/java.html|Java Information#Java#Java Information#CS 310#CS310#
/jobs.html|Computing Career Links#Career Links#Career#Careers#
/itwd/milestones.html|Milestones for Grant Project#Milestones#ITWD#
/carlsond/cs321/web/javascript.html|Notes on JavaScript#JavaScript#
Each line contains the information about a particular Web page. The
part before the pipe symbol (|) is the URL, minus the invariant
leading section, which was http://cis.stvincent.edu in my case.
Between the "|" and the first "#"
symbol is the description of the Web page, and between each neighboring
pair of "#" symbols, there is a keyword or phrase.
This format allows us to later find a particular keyword or phrase
and to distinguish it from the description and URL.
Thus, I hatched a plan to periodically run a bash script to collect
the data and place it in a text file. To do a search, users can
access an HTML form where they can fill in their keyword(s). Clicking
on the submit button on the form will send the desired keyword(s)
to a second bash script, the CGI search engine. A compiled program
to provide better security and speed later replaced this second
script. For the moment, however, we will look at the scripts because
they show what processing needs to be done.
The Shell Scripts
Harvesting the Data
The getmeta script (Listing 1) gathers the data and saves
it in a file called "keywordfile". This script can automatically
run every night so that as users add Web pages and adjust the META
tag information, the file is kept up to date. (Listings for this
article are available from: www.sysadminmag.com.)
You must decide which Web files to include in the scope of the
search engine. The getmeta script uses the find command
to scan all files with names of the form *.html that are
in the directory whose name is in the TARGET variable. By default,
TARGET contains /www. The script also scans all Web files
in the directory trees that begin with any of the subdirectories
named in the SUBDIRS variable. On this system, I only wanted to
descend into the directory trees for a few particular users or projects.
This section should obviously be adjusted to suit each particular
Web site.
The getmeta script writes the names of the specified Web
files to a temporary text file, whose name is stored in the variable
"TMP". A file is used because the amount of data might
be fairly large if many Web files are processed. The filenames are
then read, one at a time, from this temporary file using a loop
that has its input redirected to come from the file. The rough outline
of this loop is as follows:
while read filename
do
Process the filename as desired
done < $TMP
This is a pattern with many useful applications. The script processes
each filename by first checking whether users have read access, as
there is no sense in scanning a file that users cannot read. The cut
command is used to look for the "r" permission in the correct
column of a long listing for the file. The column number might need
to be adjusted to fit with the long listing format on your system.
Next, the script uses grep to get the lines of each file
that contain a tag starting with the string "<meta ".
The output is piped to another grep, which keeps only those
lines that also contain ="keywords". This output
is placed in the KEYS variable. If the data in this variable
is nonzero in size, a similar grep is used to extract the
description string. The data in KEYS is then piped into a
cut that extracts the third field where "=" is
used as the delimiter. This skips over the NAME= and CONTENT=
part of the line, giving the keywords section that follows the second
"=". Then the translate command (tr) is
used to replace the commas by # symbols and to delete the
"> that ends the meta tag line. The # symbols
are used to give us the file format already described above for
keywordfile. The description data held in the DESCRIP variable
is then refined in a similar way.
Following this, cut is used to skip over the first few
characters of the filename. When using the default settings, where
all HTML files are under the /www directory, this amounts
to skipping over the initial /www and thus keeping the characters
in column 5 onward. We do not want the /www as it will not
be part of the final URL for this file. The line starting with the
echo command is then used to send the modified filename,
description, and keywords to another temporary file. Note that the
"|" symbol is inserted to separate the filename
from the description and that the keywords are surrounded by #
symbols.
The tr -d '\015' may not be needed on some servers. It
removes carriage returns that got inserted when users edited their
Web files from Windows-based machines. This can happen on servers
that run the Samba software (http://www.samba.org) that allows
a UNIX or Linux machine to imitate an NT server. By removing the
carriage returns, we get a proper UNIX text file. At the very end,
the getmeta script copies the newly harvested data over the
top of any existing keywordfile and removes its temporary files.
Matchmaker, Matchmaker
The search program began as a bash script (Listing 2). It is a
CGI script, so your Web server must be configured to allow CGI scripts
to run. This script is the program that receives the data from the
Web-based form when the submit button is clicked. See Listing 3
for the search.html file. It contains the form with three
text boxes that the user can fill in so as to specify up to three
keywords or phrases for which to search. Actually, the data submitted
from the form is sent (as one long URL-encoded string) to the uncgi
program, which parses the data and places it into environment variables
that start with the characters "WWW_". You can
see that uncgi is specified in the following line of the
search.html file:
<FORM METHOD="POST" ACTION="../cgi-bin/uncgi/search">
The CGI script can then access the data in these variables, here named
WWW_Key1, WWW_Key2, and WWW_Key3. The uncgi
program, normally installed in the cgi-bin directory, thus breaks
apart a URL-encoded data string such as the following:
Key1=Java&Key2=Java+script&Key3=VB+script
This is the type of data string sent from the user's Web browser.
It would be awkward to handle this string directly in the script.
Instead, the above POST command specifies that the string is
sent to the uncgi program, which conveniently places the data
into three separate environment variables that the CGI script can
then easily access. It is as if we did the following assignments:
WWW_Key1="Java"
WWW_Key2="Java script"
WWW_Key3="VB script"
The uncgi program is available at a number of Web sites (e.g.,
http://www.prw.net/support/cgi/uncgi.htm).
The first section of the search script sets up some variables
and increments a counter to show that the search engine has been
accessed one more time. The count itself is kept in a text file
that can be examined whenever you want to know how much your search
engine has been used. The count could also be displayed on the Web
page showing the results of each search, if desired.
Next, the script figures out which of the WWW variables contain
nothing (zero bytes) with the -z test. The values in the
variables are copied as needed so that all three variables contain
a value, where duplicates are used to fill in for an empty value.
For example, if the user enters C++ in a text box and leaves the
other two boxes blank, the script copies C++ into the variables
corresponding to these two other boxes.
The grep -i command is then used to do a case-insensitive
search in keywordfile for the keyword in the first WWW variable.
The # symbols are used to be sure that any match is for a
keyword and not a word that simply appears in a description or URL.
The results of this grep are piped into a second grep
that looks for the keyword in the second WWW variable. The output
of this last command is piped into a third grep that looks
for the keyword given by the last WWW variable. Thus, we get just
those lines of the keywordfile that contain all three keywords in
the keyword section of the line. This data is written to a temporary
file.
The search script now outputs what needs to be sent to the user's
Web browser to display a page about any matches that were found.
We first print out the string Content-type: text/html, followed
by a blank line. Then we output the information about matches, marked
up with appropriate HTML tags. To make things easier, the initial
lines of HTML are copied from the file named by the HEAD
variable.
The -z test is used to see if there was no keyword for
which to search. If so, an error message is displayed on the user's
Web page. Next, we reuse our favorite loop pattern, with input redirected
to come from the temporary file of matches:
while read item
do
Process item as need be
done < $TMP
Note that the number of lines (matches) is counted in the MATCHES
variable. The filename is extracted from each item (line) by using
cut -d "|" -f 1, which obtains the first field that is delimited
by the | symbol. The description field is extracted in a similar
way, though two uses of cut are needed: one to extract the second
field delimited by |, and the other to get the first field
delimited by #.
The data for each match is then written out as a list item in
an ordered list. The URL portion is written out as a clickable link,
with TOPURL preceding the filename so as to give a complete
URL. The TOPURL variable contains the value ("http://cis.stvincent.edu"
on our server) that must precede all Web filenames on the system.
Note that if Web files on your system are scattered under users'
home directories, then this will not work. The description is written
out immediately after the URL. The Web server automatically sends
this output to the user's browser.
Finally, the search script checks the number of matches to see
whether it was zero so that a message can be printed for that special
case. Some closing HTML is written out, and the script is done.
Security Concerns
Although the above search script worked fine, there were a couple
of concerns. CGI scripts are often susceptible to hacking attempts,
such as supplying bad input like the following:
C++ | cat /bin/passwd
A hacker could submit this via the form in search.html with
the hope that the | symbol would cause the script to execute
the command to cat out the password file. This would show the
login IDs for all users, allowing the hacker to try a dictionary attack
on user passwords. Although this type of attack did not seem to work
on our system, it might be possible for someone to find an attack
that worked. Information on setting up the Apache Web server to reduce
security problems can be found at the Apache Web site (http://www.apache.org).
Information about CGI security problems can be found in "Safer
CGI Scripting" by Charles Walker and Larry Bennett, Sys Admin,
February 2001.
Another concern is that a script is interpreted and runs more
slowly than a compiled program. Because our search engine was not
being run often and the response time was brief, this was not a
big concern. Still, usage might increase in the future, so it could
help to have a compiled program as the search engine. Because it
is also easier with a compiled program to weed out bad input that
might indicate an attack, I decided to switch to a compiled C++
program.
Using a Compiled CGI Program
The search.cpp program performs the same overall task as
the search CGI script. However, the GetValue function, found
in stringhelp.cpp, does some additional processing to reject
bad input. This function gets the value of a WWW environment variable.
The function is careful not to overflow the Result array
when copying characters from the environment variable. It also only
copies alphanumeric characters, the + sign, the -
sign, the period character, the space character, and the NULL character
that marks the end of a string. At the first sign of any other character
(such as |, or other hacker favorites), the function quits
and returns the empty string in Result. You can adjust the
code in GetValue if you want to allow additional characters,
but be careful what you allow. The rest of the C++ program will
not be examined here as it does the same processing as the old search
script.
Configuration and Installation
QSearch can be downloaded from my Web site:
http://cis.stvincent.edu/carlsond/software/software.html
A Readme file is included to explain configuration and installation
issues. The main steps are covered here.
What do you need in order to use this software? It may be a reasonable
choice if you do not have a huge number of Web pages and the search
engine will not be heavily used. You need to have the g++
compiler to compile the C++ program. You also need to have the uncgi
program installed, and the Web server must be configured to run
CGI programs. The software assumes that all Web pages to be searched
are located under a common directory, such as the default /www
directory. The META tags containing the keywords and descriptions
must fit the format mentioned at the start of this article.
Edit the getmeta script to adjust the following four lines
for your situation:
TARGET="/www"
START=5
SUBDIRS="/www/itwd /www/carlsond /www/carrc /www/morrisoh /www/hicksb"
KEYFILE="/www/cgi-bin/keywordfile"
Note that TARGET should indicate the directory under which
all of your Web files are located. The START number must be
one more than the number of characters in this TARGET string.
SUBDIRS should hold a string containing any subdirectories
of the TARGET directory into which you want to descend to find
Web files. Finally, KEYFILE should give the location for the
keywordfile that getmeta generates, a location within the directory
for CGI programs.
Edit the search.cpp file and adjust the following lines
as needed:
#define TOPURL "http://cis.stvincent.edu"
#define HEAD "head.html"
#define KEYFILE "keywordfile"
#define COUNTFILE "search.count"
The first line should give the initial common part of all URLs on
your Web site. The second gives the name of the HTML file that is
to be displayed as the top part of the search-results Web page. A
sample head.html file is supplied. The third and fourth lines
probably do not have to be changed, although it is important that
KEYFILE contain the exact name for the text file created by the getmeta
script. The only other item that you might want to modify is the following
line of the file stringhelp.h:
const int StrMax = 800
This should be set to a reasonable maximum string length. Remember
that each line of META tag data is stored in one of these strings.
If your Web pages contain META tag lines with long descriptions
or long lists of keywords, it is possible that you will need a longer
StrMax. Now start the compiler with:
g++ search.cpp stringhelp.cpp -o search -s
This should produce an executable search program. Then log in as root
and change the permissions and ownership of the getmeta script
as follows:
chmod 700 getmeta
chown root.root getmeta
Use root's crontab entry to schedule getmeta to
be run once a day:
crontab -e
The crontab entry might look like the following if you want
to run getmeta at 5:45 a.m. every day. Adjust the path to getmeta
as needed:
45 5 * * * /usr/local/bin/getmeta
Move head.html and your compiled search program to the directory
for your CGI programs. Change the permissions and ownership as follows:
chown root.root head.html search
chmod 755 search
chmod 644 head.html
In this same CGI directory, create a file named search.count
and place the number 0 into it. This file will contain the count of
how many times the search engine has been used. You can create this
file as follows:
echo 0 > search.count
chown root.root search.count
chmod 666 search.count
You may be able to be more restrictive here. For example, if your
Webserver runs as user Webmaster, change the last two commands to:
chown webmaster.users search.count
chmod 644 search.count
Then ordinary users cannot change the file, although they can read
it.
Place the search.html file in any directory of Web pages
you wish. You can edit this file to add graphics or other enhancements.
Adjust ownership and permissions as follows:
chown root.root search.html
chmod 644 search.html
Add a link to search.html in those Web pages where you wish
to provide access to the search engine.
You are now ready to take the search engine for a test drive.
Run getmeta once manually as root to produce a keywordfile.
This can be done by entering ./getmeta while in the directory
that contains this script. Then use a Web browser to look at the
search.html file and try out the search engine.
Possible Improvements
One enhancement is to allow spaces after the commas separating
keywords in the META tags. This was not done here in order to keep
the programming simple. It might also be desirable to allow more
general types of searches than just exact matches for keywords,
although that would also complicate the programming. The getmeta
script could probably be written more compactly in Perl. You could
even write your own search daemon if you know sockets programming.
However, such a complex programming project would take us away from
the original goal of having a simple, easy-to-produce search engine.
Refer to:
http://cis.stvincent.edu/carlsond/cs330/unix/unix.html
for my UNIX Web page with links to UNIX and Linux information, including
tables and examples on writing shell scripts. Refer to:
http://cis.stvincent.edu/carlsond/swdesign/
for my Web pages on C++ programming, which might be helpful in modifying
the search program.
Br. David Carlson is a Benedictine monk as well as chairperson
and associate professor in the Computing & Information Science
Department at Saint Vincent College. When his primary jobs allow,
he can often be found doing systems administration on his department's
Linux server. He can be reached at: carlsond@stvincent.edu.
|