Shell Scripting and the WWW: Creating a Searchable Web Database
Tom Tessier
Common Gateway Interface (CGI) scripts are one of the
main techniques
used to increase interactivity on web pages. Normal
HTML documents
retrieved from WWW servers are static, in contrast to
CGI scripts that
allow web output to be generated in real-time. Whether
you're emailing
forms, creating interactive trivia games, implementing
searchable
databases, or setting up web animations, CGI is the
way to go.
The CGI defines how custom scripts must interact with
web server
software. This interface is implementable in any language
that can read
environment variables and direct strings to console
output. Because of
its powerful environment and string parsing commands,
shell scripting is
in fact a good choice for the creation of CGI code.
The scripts must be stored in a special directory so
that the server
knows the files are executable. With NCSA's freeware
web server
distribution for example, the subdirectory is called
cgi-bin (within the
httpd directory).
Creating Forms
Listing 1 illustrates how the HTML code is set up for
a typical
CGI-served web page. By placing method="get"
ACTION="URL/cgi-bin/callme.sh" within the
<FORM> tag, one is stating the
instructions the server must follow when the Submit
button is pressed.
In this case, when the form is submitted, the contents
of the two inputs
(FIRST and LASTNAME) will be transferred to the script
callme.sh via the
environment variable QUERY_STRING. See Table 1 for a
description of the
different methods available for form submission.
The rest of the code in Listing 1 is standard HTML:
<BR> is used to
cause line breaks, <HTML><BODY> initiate
the web session, </BODY></HTML>
close it. The default size for the form inputs is 20
characters, so by
allocating 50 characters for the LASTNAME entry, you
are ensuring that
the user will have enough space to read their full last
name as it is
entered. If a MAXLENGTH=integer tag is included after
the SIZE, then
only MAXLENGTH characters are allowed. If not specified,
an unlimited
amount of text may be entered, with the input field
scrolling as the
display is filled. Modifiers available for the <INPUT>
tag are detailed
in Table 2.
The contents of the variable QUERY_STRING (passed to
theCGI script when
the form is submitted) may appear quite cryptic at first.
For example,
if the FIRST entry was blank and the LASTNAME field
contained "Joe
Smith," callme.sh would have the following arguments
passed to it:
FIRST=&LASTNAME=Joe+Smith%13, where %13 is a carriage
return added by
the web browser. Notice also that space characters are
not allowed
(hence, the web browser converts the space between "Joe"
and "Smith" to
a "+"). The formatting of the QUERY_STRING
environment variable is
summarized in Table 3.
CGI scripts may also be called directly by use of an
HREF tag. For
example, <A HREF="URL/cgi-bin/callme.sh?INPUT1=0">
Click me</A> would
call the script callme.sh when "Click me"
is pressed. All of the
arguments after the question mark are transferred to
the script via
QUERY_STRING.
Environment parsing
Once the CGI script itself has been executed, it must
properly read and
parse the QUERY_STRING environment variable. The awk
shell command
proves to be very useful in this regard. awk, which
is really just a
pattern-directed scanning and processing tool, is used
to break
QUERY_STRING into a collection of smaller, easier to
handle variables.
Listing 2 shows how this is done.
First of all, security breaches are weeded out by the
use of sed.
Specifically, the ";" and "'"characters
are converted to ESC. This
prevents someone from entering a command in the form
input, such as ";
touch newfile," which would be converted to a harmless
escaped input as
in "\ESC touch newfile." The BEGIN { RS="&";
FS="=" } statement within
the awk indicates to use "&" as the input
record separator, while "=" is
used as the input field separator (instead of the default
space
character). Hence the QUERY_STRING is broken up into
records based upon
the location of "&" identifiers. For example,
a QUERY_STRING of
INITIAL=j&LASTNAME=wilson will give two records:
INITIAL j and LASTNAME
wilson.
The printf statement prints the arguments, which are
themselves
evaluated as commands because of the eval indicator
surrounding the
whole expression. Using the QUERY_STRING given above,
the printf would
create two environment variables as follows: QS_INITIAL='J'
and
QS_LASTNAME='WILSON'. Notice in Listing 2 that the toupper($2)
statement
converts the second argument to uppercase. This allows
later comparisons
to contain just uppercase characters in the match fields,
saving space.
The $1 ~ /^[a-zA-Z][a-zA-Z0-9_]*$/ line in Listing 2
ensures that the
names of the form inputs have been set properly for
the creation of
environment variables. Specifically, an environment
variable name is
only allowed to start with alphabetic characters (hence,
the ^[a-zA-Z])
but may end with any alphabetic and numeric characters
(the
[a-zA-Z0-9_]*$ statement).
Because of the carriage return added to QUERY_STRING
by the browser, it
is necessary to use sed to delete the %13 at the end
of the last
argument. This is illustrated by the line QS_LASTNAME=`echo
$QS+LASTNAME
| sed /%13/d` in Listing 2 (it assumes that the script
was called by
Listing 1).
Implementing a Database Search Engine
Once the environment variables have been set to the
form inputs,
standard shell scripting commands may be applied. Keep
in mind that all
output appearing after each executed echo statement
will be displayed on
the web page. Listing 3 and Listing 4
demonstrate a simple
CGI database search
utility that scans a datafile for employee email addresses.
A few things to note about the implementation: wildcard
matching via a
"*" in the LASTNAME input field is available.
For example, if the user
enters joe* into the LASTNAME input, everything before
the star must be
at the beginning of the database last name entries in
order for a match
to occur. Thus, database matches such as JOEMOE and
JOEBLOW will work
for joe*, but MOEJOEM will not. Similarly, if the user
enters *joe,
everything after the star must match the last characters
in the
database. MOEJOE and BLOWJOE will match, JOEBLOW will
not. If the user
only enters joe, any entry containing the characters
JOE is displayed.
The determination of this wildcard matching is somewhat
complex and
deserves mentioning. The line tempvar=`echo $QS_LASTNAME
| awk '{ printf
"%c%s%c", 39, substr($0, 0, 1), 39 }' ` in
Listing 4 strips the first
character out of the QS_LASTNAME environment variable
via the substr($0,
0, 1) command (cut out a substring of length 1 from
the start of the
string argument $0). If the first character is really
a "*", the match
is checked with a grep "$qstemp"'$', the dollar
sign at the end meaning
that a match will occur only if the string is at the
end of the database
field text.
The tempvar=`echo $QS_LASTNAME | awk '{ printf "%c%s%c",
39, substr($0,
length, 1), 39 }' ` in Listing 4 strips out the last
character in
QS_LASTNAME via substr($0, length, 1), which indicates
to cut a
substring of length 1 from the end (hence, length) of
the $0 string
argument. So, if the last character of QS_LASTNAME is
a "*", the match
will be analyzed with grep '^'"$qstemp", with
the caret sign at the
beginning indicating a match is only to occur if the
string is at the
start of the database entry.
The format for the database text file is as follows:
each field is
separated by a "!" character. There are four
fields: initial, last name,
email address, and full name. Only the first two fields
are used in the
search process; the last two are used when a match is
found. In the full
name field, a "+" is used in place of the
space character. Each entry is
terminated by a carriage return (no Ctrl-Ms) or a space.
A sample
datafile is illustrated in Listing 5.
With the techniques presented here, you'll be creating
CGI scripts for
your systems in no time. Whether you use them to process
form input,
create searchable databases, or send email, CGI shell
scripts are the
way to go for easily implemented interactive pages.
About the Author
Tom Tessier is an Engineering Physics student at the
University of Alberta, Canada. He can be reached at
tessier@ee.ualberta.ca.
|