Article

Scripting and HTML

Mike Murphy

Programming for the World Wide Web using the Common Gateway Interface allows great flexibility in choosing a scripting language. As basic requirements, the programming language must be able to print to standard I/O streams and both access and manipulate environment variables. The CGI program simply outputs content to the standard output, starting with a header that describes the content. The rest of the output is handled based on its type. From that point, the Web server hosting the CGI sends the resulting content back to the user's browser. The most common use for CGI is returning HTML content, which can be dynamic based on current conditions on the Web server, from input by the user, or a combination both.

Traditionally, a CGI program must explicitly output its own HTML content using language-specific functions like echo, print, or printf. Today, many server-side technologies allow a Web programming or macro language to be embedded directly inside an HTML file. One such technology gaining rapid momentum is PHP. Java Server Pages is another, more commercial, example. These technologies each possess features that go far beyond what is discussed here, but embedded scripting can also be a nice feature for CGI programming.

With the embedded model, HTML data can be changed quickly without consideration for how it will be output to the browser. Also, portions of HTML from other collaborators can simply be included in the program without additional code changes.

Nearly any UNIX scripting language can be embedded directly into HTML without using the <embed> HTML tag, which requires special plug-ins. However, not every scripting language is ideal for programming the CGI. This is because in the past, parsing form data in some UNIX languages was a chore. For example, assume you want to parse data taken from an HTML form and use that data for some UNIX program that monitors system performance. Perl may be the best language for decoding the form data, but you may want to use an ordinary UNIX shell for organizing the rest of the program. Using the method described in this article, form data is immediately accessible to the embedded CGI script in the form of environment variables (assuming the form uses GET or POST action methods). This means the right scripting language can be chosen on its intrinsic merits, without resorting to outside libraries in order to access the user's form data.

To accomplish all of this with Apache, I will first describe what changes need to be made to the Web server's configuration files. I will then discuss the filter that will be used behind the scenes to allow embedding, and provide a simple example of the same CGI program written in embedded Korn shell, Perl, and Tcl. The example configuration will allow support for all three languages and establish a template for adding other language support to your Web server. Various packages are available both publicly and commercially that support embedded scripting, usually for a particular language or macro set. While I do not presume to replace such tools here, my goal is to show how remarkably flexible the Apache server is, especially with the rich set of modules available with the standard distribution.

There are some limits to what our inline scripting configuration will allow. Compiled languages (like C or C++), while popular for CGI, do not fit this particular paradigm. To make our inline scripting possible, two small requirements must be met: the scripting language must be interpreted at run-time, and the scripting language must support “here documents”.

The first requirement means that the UNIX system on which your Web server runs must understand #!. This notation is used by the UNIX kernel to determine whether an executable program is a script or a compiled binary format. When an execution request is made, the UNIX kernel looks for #! in the first two bytes of the file, and then expects the path of program used to interpret the rest of the script to follow. You have probably seen #!/usr/bin/perl, #!/usr/bin/csh, or #!/bin/sh at the top of UNIX scripts. Any scripting language whose interpreter can be invoked this way meets the first requirement.

The second requirement does not limit the choice of scripting languages much either. So-called “here documents” are really a syntax allowing chunks of text to be redirected as standard input to a line of code. A here document generally has the form:

cat <<EOF
<HTML> <HEAD> ....
EOF

This example redirects everything before the last “EOF” as standard input to the UNIX cat command. This is a common example for CGI programming. Our modification to the standard Apache installation will use here documents to automatically pre-process HTML files containing embedded scripts, providing the correct output while not showing the intermediate product -- a more or less traditional CGI program with here documents wrapping the HTML. This version is executed on the server after being generated by a relatively simple filter written in Perl, which I will discuss in a moment.

All of the Apache modules required to support this activity are included in the standard Apache configuration. A configuration script is run during the installation of Apache to configure the makefile and select, among other things, which Apache modules will be included in the build. Unless you commented these modules out of the configuration file, you should have access to use mod_actions, mod_env, and mod_mime. If you do not have these modules in your build, you will need to recompile Apache with them uncommented in the configuration file to make these examples work. If you performed the standard installation, you should be able to configure Apache to use this embedded scripting method.

Setting Up Embedded Scripting

We will begin by editing the httpd.conf (Listing 1) file in the Apache configuration directory. Technically, you could also add these optional directives to your srm.conf file where you might want to isolate local changes. I chose to place them in my httpd.conf, because it is now a consolidation of access.conf, srm.conf, and the original httpd.conf. As always, keep a pristine copy of the configuration files saved under a different name.

We need to add a MIME type for each scripting language we intend to support. Since only one inline scripting language can be supported within each HTML file, we must determine which scripting language is going to be used by giving the actual file a particular extension. To start, let's add MIME-type extensions for Korn shell, Perl, and Tcl, and associate these file types with a particular extension:

AddType        text/inline-ksh         .ksh
AddType        text/inline-perl        .prl
AddType        text/inline-tcl         .tcl

For our purposes, there is no client-side configuration for these custom MIME types; it will only be visible on the server side. All content returned to the client will be text/html.

Now that the Web server knows something about the content of files with these particular extensions, we need to tell it what to do with them. The Apache module mod_actions allows the execution of a program based on MIME extensions, or request method types. We tell Apache that we want to execute a filter whenever a file with the supported scripting languages is requested from the server. Since we are embedding in HTML, we need to think of documents as ordinary HTML files, not as programs in /cgi-bin. Apache must be told what action to take when a particular document type is requested:

Action          text/inline-ksh         /cgi-bin/filter.pl
Action          text/inline-perl        /cgi-bin/filter.pl
Action          text/inline-tcl         /cgi-bin/filter.pl

Now Apache understands what to do with these files when they are requested -- run the filter program. But how does the filter program know what kind of language is embedded in the HTML? I will discuss the filter's logic in a moment, but first, we need to add another set of directives for the Apache module, called mod_env. This module will allow us to pass environment variables directly to a CGI program from the Web server. The following information is required:

SetEnv KSH_INTERPRETER /usr/bin/ksh
SetEnv KSH_HEREDOC_OPEN "cat <<EOF"
SetEnv KSH_HEREDOC_CLOSE "EOF"

SetEnv PRL_INTERPRETER /usr/bin/perl
SetEnv PRL_HEREDOC_OPEN "print <<EOF;"
SetEnv PRL_HEREDOC_CLOSE "EOF"

SetEnv TCL_INTERPRETER /usr/bin/tcl
SetEnv TCL_HEREDOC_OPEN "set EOF {"
SetEnv TCL_HEREDOC_CLOSE "} ; puts $EOF"

For each language we want to embed, the corresponding environment variables will be used by our filter to output a working version of the embedded script and execute it. The variable names are fairly self-explanatory:

_INTERPRETER variables define the path of the scripts interpreter
_HEREDOC_OPEN variables contain valid instructions to open a here document in that particular scripting language
_HEREDOC_CLOSE variable prints a valid end of input marker for a here document, again, in the particular syntax of the language in question. This, at least, only has to be done once (as opposed to once for each section of HTML as in a traditional CGI program).

Note that the prefix to each of these environment variable names is an uppercase version of the MIME-type extension (e.g., .ksh variables all start with KSH). This is not arbitrary. The filter that pre-processes the embedded scripts needs to check the CGI environment variable $PATH_INFO to get the name of original document requested from the server. If a user requests a document named guestbook.prl from our Web server, the filter will find the .prl extension and use that information to look up the variables that start with PRL_. The Apache module mod_env will pass these variables at run-time based on the SetEnv directives in our configuration file. With these general guidelines, you should be able to add support for any scripting language that you would like to embed in HTML.

We also need a place to keep the CGI programs output by the filter. Add the following directive (of course, local paths may vary with preference):

SetEnv CACHE_PATH /usr/local/apache/cache

This directory should already exist or be created when the new httpd.conf file is used to start Apache. The Web server user should have read, write, and execute bits set on the directory. This does not need to be a CGI directory according to httpd.conf as long as the permissions are sensible, but that is a matter of preference.

The Filter

We are now ready to look at the source code for the filter program itself. (see Listing 2). The filter first outputs the content type header for HTML. Set the per-filehandle special variable to a true value (e.g., $|=1 ), so that the output will not be buffered. Although you might expect the stdout buffer to flush when a new line is printed, that may not be the case in the Web server environment where the output stream is ultimately accessed by several processes. Without setting $| to a true value, you may not see the content type header printed until after the output from the cached script is already printed to the browser! This results in the same output a user would see if they chose “view source” after a CGI program had already been run, or if a text/plain content type had been sent to the browser instead.

Next, the subroutine SetScriptLang populates variables for the use of the filter. Some of these variables are associated only with the MIME type of the document requested, according to the directives for mod_env placed in the httpd.conf file. This subroutine prepends an uppercase version of the MIME type to some environment variable names needed for filtering. For example, if they request example.prl, SetScriptLang will look for the environment variables PRL_HEREDOC_OPEN, PRL_INTERPRETER, and PRL_HEREDOC_CLOSE. If they requested example.ksh, it would look for KSH_HEREDOC_OPEN, etc. This allows each scripting language to be configured with its own appropriate syntax for wrapping the HTML tags with here documents.

This subroutine also looks for the location of the script cache directory in the variable CACHE_DIR, which is where the filter creates and executes CGI programs after it pre-processes them. This directory should only be accessible by the user id of the Web server (see User directive in httpd.conf). The next line in the program checks to see whether the cached script is up to date with the requested document. If the requested document has not changed more recently that the cached script, the cached version is executed to save time. If not, a new version of the script is created by the sub UpdateCache, and then the output script will be executed and remain in the cache. The line of code that checks the modification times on the two files requires the use File::stat include. Otherwise, we would have to replace the fairly legible ((stat($Script)->mtime) with something like (stat($script))[9], which uses a slice to take the ninth element returned from stat -- the modification time.

If UpdateCache is run either because the cached script does not exist or because it is out of date, it first attempts to open the embedded document. It then attempts to open the script cache file for writing. If the file exists, it will be replaced with a zero byte file when it is opened for writing. If the file used to write the script to the cache directory cannot be opened, the subroutine calls die with a fairly explanatory message. This output to STDERR will also be captured in Apache's ./logs/error_log file, which is useful for troubleshooting. If the file is opened successfully, the subroutine immediatly performs an exclusive lock on the file with the flock command. In a Web server environment, one request can be writing the script while another request is attempting to run it prematurely. By locking the file, we can ensure that other processes trying to use the file will have to wait until the lock is released at the end of the subroutine.

Because the user document may or may not begin with HTML tags, the program assumes that it does and outputs the first here document header. The next step is set the $state scalar variable to a value of html, which means “assume we are inside of an HTML section, not a code section”. The first regular expression, s///g, strips HTML comments out of $_, which in this context represents the current line in the requested document. The program then continues to scan through the file in the while loop, looking for the first code tag. If found, it is replaced in the current line with the text that will close our HTML section. It then sets state to “script” and does something very similar until it sees the ending code tag. Code tags are defined as <? followed by one or more whitespace characters (\s+), and ?>, which is preceded by one or more whitespace characters. The final regular expression s/(<\?\s+)|(\s+\?>)//g simply strips any out-of-order tags from the current line. For example, if the user-requested document had tags like <? <? # just a comment ?>, it would output <? # just a comment ?> (ignoring other expressions for clarity). At the of the loop, the currently modified line is output to the cache script and the process continues until all of the originally requested document has been scanned. Then the HTML here document is closed if necessary, the cached script is unlocked, and the files that were opened are closed.

The next subroutine to be called is ExportFormData. This essentially uses the strength of Perl to do the work of decoding a URL that contains data from a form, whereas the scripting language embedded in a particular HTML file may not be as amenable to the task. This may make using your favorite language for CGI programming (if it's not Perl) much easier. When your embedded script runs, it can access the form data elements directly though an environment variable. For example, if your form data is sent with a URL of ?name=John+Doe, your program will find the environment variable “name” already holding the value “John Doe”. Perl has the very handy %ENV hash, which can be used to modify the current environment, and the system() command is used to export this data to the process environment that your embedded script will inherit. This works for both GET and POST standard form action methods.

Once all of this is finished, we simply add execute permission for the Web server user and run the cached version of the CGI program. The program then prints to STDOUT, which is returned to the user's browser as valid HTML output (assuming, of course, that the CGI program works correctly). You may wish to add debugging flags to your interpreter lines in httpd.conf, or run the cached version of the script directly to monitor the output.

Embedded Script Examples

Using this method of embedding scripts, each of the following examples should produce a “Hello, World!” message that grows in size when displayed in a browser. These documents would reside in the htdocs directory (or anyplace configured as a document directory in httpd.conf). (See Listings 3-5.)

While these examples are clearly trivial, they all output valid content to the browser as expected. It is important to remember that the filenames must contain the appropriate extensions to identify which language is embedded within them. Had these examples been called as the action clause of a form tag in HTML, they would all be able to access the form data through environment variables -- of course, with each language using its own supported method to access them.

In summary, this is merely one way to embed virtually any scripting language inside of HTML files, using the Apache Web server. There are several issues such as FastCGI support that have not been addressed within this article.

About the Author

Mike Murphy works as a consultant for Ciber, Inc. in Rochester, New York. He lives with his wife, Heather, and 18-month-old daughter, Madison. Mike can be reached at: murphy_mikej@ciber.com. A swift reply is not assured, as Madison does not always like to share her computer.