Cover V05, I04
Listing 1
Listing 2
Listing 3
Listing 4
Listing 5
Table 1
Table 2
Table 3


Shell Scripting and the WWW: Creating a Searchable Web Database

Tom Tessier

Common Gateway Interface (CGI) scripts are one of the main techniques used to increase interactivity on web pages. Normal HTML documents retrieved from WWW servers are static, in contrast to CGI scripts that allow web output to be generated in real-time. Whether you're emailing forms, creating interactive trivia games, implementing searchable databases, or setting up web animations, CGI is the way to go.

The CGI defines how custom scripts must interact with web server software. This interface is implementable in any language that can read environment variables and direct strings to console output. Because of its powerful environment and string parsing commands, shell scripting is in fact a good choice for the creation of CGI code.

The scripts must be stored in a special directory so that the server knows the files are executable. With NCSA's freeware web server distribution for example, the subdirectory is called cgi-bin (within the httpd directory).

Creating Forms

Listing 1 illustrates how the HTML code is set up for a typical CGI-served web page. By placing method="get" ACTION="URL/cgi-bin/" within the <FORM> tag, one is stating the instructions the server must follow when the Submit button is pressed. In this case, when the form is submitted, the contents of the two inputs (FIRST and LASTNAME) will be transferred to the script via the environment variable QUERY_STRING. See Table 1 for a description of the different methods available for form submission.

The rest of the code in Listing 1 is standard HTML: <BR> is used to cause line breaks, <HTML><BODY> initiate the web session, </BODY></HTML> close it. The default size for the form inputs is 20 characters, so by allocating 50 characters for the LASTNAME entry, you are ensuring that the user will have enough space to read their full last name as it is entered. If a MAXLENGTH=integer tag is included after the SIZE, then only MAXLENGTH characters are allowed. If not specified, an unlimited amount of text may be entered, with the input field scrolling as the display is filled. Modifiers available for the <INPUT> tag are detailed in Table 2.

The contents of the variable QUERY_STRING (passed to theCGI script when the form is submitted) may appear quite cryptic at first. For example, if the FIRST entry was blank and the LASTNAME field contained "Joe Smith," would have the following arguments passed to it: FIRST=&LASTNAME=Joe+Smith%13, where %13 is a carriage return added by the web browser. Notice also that space characters are not allowed (hence, the web browser converts the space between "Joe" and "Smith" to a "+"). The formatting of the QUERY_STRING environment variable is summarized in Table 3.

CGI scripts may also be called directly by use of an HREF tag. For example, <A HREF="URL/cgi-bin/"> Click me</A> would call the script when "Click me" is pressed. All of the arguments after the question mark are transferred to the script via QUERY_STRING.

Environment parsing

Once the CGI script itself has been executed, it must properly read and parse the QUERY_STRING environment variable. The awk shell command proves to be very useful in this regard. awk, which is really just a pattern-directed scanning and processing tool, is used to break QUERY_STRING into a collection of smaller, easier to handle variables. Listing 2 shows how this is done.

First of all, security breaches are weeded out by the use of sed. Specifically, the ";" and "'"characters are converted to ESC. This prevents someone from entering a command in the form input, such as "; touch newfile," which would be converted to a harmless escaped input as in "\ESC touch newfile." The BEGIN { RS="&"; FS="=" } statement within the awk indicates to use "&" as the input record separator, while "=" is used as the input field separator (instead of the default space character). Hence the QUERY_STRING is broken up into records based upon the location of "&" identifiers. For example, a QUERY_STRING of INITIAL=j&LASTNAME=wilson will give two records: INITIAL j and LASTNAME wilson.

The printf statement prints the arguments, which are themselves evaluated as commands because of the eval indicator surrounding the whole expression. Using the QUERY_STRING given above, the printf would create two environment variables as follows: QS_INITIAL='J' and QS_LASTNAME='WILSON'. Notice in Listing 2 that the toupper($2) statement converts the second argument to uppercase. This allows later comparisons to contain just uppercase characters in the match fields, saving space.

The $1 ~ /^[a-zA-Z][a-zA-Z0-9_]*$/ line in Listing 2 ensures that the names of the form inputs have been set properly for the creation of environment variables. Specifically, an environment variable name is only allowed to start with alphabetic characters (hence, the ^[a-zA-Z]) but may end with any alphabetic and numeric characters (the [a-zA-Z0-9_]*$ statement).

Because of the carriage return added to QUERY_STRING by the browser, it is necessary to use sed to delete the %13 at the end of the last argument. This is illustrated by the line QS_LASTNAME=`echo $QS+LASTNAME | sed /%13/d` in Listing 2 (it assumes that the script was called by Listing 1).

Implementing a Database Search Engine

Once the environment variables have been set to the form inputs, standard shell scripting commands may be applied. Keep in mind that all output appearing after each executed echo statement will be displayed on the web page. Listing 3 and Listing 4 demonstrate a simple CGI database search utility that scans a datafile for employee email addresses.

A few things to note about the implementation: wildcard matching via a "*" in the LASTNAME input field is available. For example, if the user enters joe* into the LASTNAME input, everything before the star must be at the beginning of the database last name entries in order for a match to occur. Thus, database matches such as JOEMOE and JOEBLOW will work for joe*, but MOEJOEM will not. Similarly, if the user enters *joe, everything after the star must match the last characters in the database. MOEJOE and BLOWJOE will match, JOEBLOW will not. If the user only enters joe, any entry containing the characters JOE is displayed.

The determination of this wildcard matching is somewhat complex and deserves mentioning. The line tempvar=`echo $QS_LASTNAME | awk '{ printf "%c%s%c", 39, substr($0, 0, 1), 39 }' ` in Listing 4 strips the first character out of the QS_LASTNAME environment variable via the substr($0, 0, 1) command (cut out a substring of length 1 from the start of the string argument $0). If the first character is really a "*", the match is checked with a grep "$qstemp"'$', the dollar sign at the end meaning that a match will occur only if the string is at the end of the database field text.

The tempvar=`echo $QS_LASTNAME | awk '{ printf "%c%s%c", 39, substr($0, length, 1), 39 }' ` in Listing 4 strips out the last character in QS_LASTNAME via substr($0, length, 1), which indicates to cut a substring of length 1 from the end (hence, length) of the $0 string argument. So, if the last character of QS_LASTNAME is a "*", the match will be analyzed with grep '^'"$qstemp", with the caret sign at the beginning indicating a match is only to occur if the string is at the start of the database entry.

The format for the database text file is as follows: each field is separated by a "!" character. There are four fields: initial, last name, email address, and full name. Only the first two fields are used in the search process; the last two are used when a match is found. In the full name field, a "+" is used in place of the space character. Each entry is terminated by a carriage return (no Ctrl-Ms) or a space. A sample datafile is illustrated in Listing 5.

With the techniques presented here, you'll be creating CGI scripts for your systems in no time. Whether you use them to process form input, create searchable databases, or send email, CGI shell scripts are the way to go for easily implemented interactive pages.

About the Author

Tom Tessier is an Engineering Physics student at the University of Alberta, Canada. He can be reached at