Article

Building a Linux Web Server

Jonathan Feldman

Secure in reading email at our workstations and happy with our ability to squirt Internet mail through our IPX-to-IP gateway, our organization didn't put up a Web server for quite some time. But time marched on, the Olympic Games drew near, and the powers-that-be wanted to put descriptions of our Olympic training facilities on the Web - Savannah hosted the 1996 Olympic Yachting events, you know.

At first, we seriously considered Windows NT. It seemed easy to manage, robust, and it had the requisite "freeware" http (Hyper Text Transfer Protocol) service available (included with the NT resource kit and also available on the Internet). However, at the time, it seemed that Perl support for NT was not all there. Perl tends to be the language of choice for writing Common Gateway Interface (CGI) scripts, so this was of major concern. We also were not thrilled with NT's inclination to ask for a reboot every time you change a networking parameter.

We were already running production-level Linux, and we didn't have any production NT. Still, it was a toss-up between Linux and NT until we saw our budget for implementing this project: $0. Naturally, Linux once again provided our solution.

We implemented our Linux-based Web server starting with a pilot server, then moved the operation to a "cloned" Linux box when we were satisfied with the results. We have now been Webmeisters for more than 6 months and are very happy with how it has worked both in pilot and in production. Not only does this system serve the Internet at large, but it also serves "home base" documents for our public library system's "public Internet" terminals.

Background

So what does the term "Web server" mean? From a Unix/Linux standpoint, a Web server is a daemon that:

Listens at the Well-Known Service (WKS) socket for http (number 80) or a user-defined socket until a request comes in

Reads an http request from the socket

Forks and fulfills the request, either by spewing forth a binary or text stream on a negotiated socket (Figure 1), or by running an external program with supplied parameters.

So, "Web" servers are really http servers. They listen for http requests, then dish out the requested hypertext on the negotiated socket to the connected client. The hypertext usually includes HTML (Hyper Text Markup Language), GIF pictures, and so on. The server is not limited in what it can serve on a one-to-one basis. You are only limited by what the client can handle. You could serve compressed cpio dumps if you wanted to - but the browser client would need some sort of helper application to read/list/extract those dumps. Otherwise, the browser would not know how to handle the data, and the user's only choice would be to download the dump.

Note that the http world doesn't use Unix-style "magic" numbers for data typing. Instead, it relies on the perhaps more explicit DOS-world convention of using a dotted file extension to denote data type. For example, HTML files are usually file.html or file.htm, and predictably, GIF style pictures are file.gif.

There is one wrinkle in this mostly simple service: most, if not all, http servers or daemons support the Common Gateway Interface. Basically, CGI, in conjunction with the http stream (usually via HTML forms), will allow the end-user to run a program on your box (scary, huh?) in a captive, noninteractive mode.

Specifically, CGI parameters are supplied from the browser, via http, to your daemon. The parameters are then either placed in an environment variable or provided as standard input to the CGI script in question, which is run as a child process of the httpd (again, see Figure 1). The program name is also provided by the browser (which it can get from an HTML form). The http daemon ensures that the program exists in its CGI directory and then runs it. Standard output of the CGI program will be piped back to the browser client, again, via your http server and the negotiated socket. Usually, the output of CGI programs are HTML, though this is not mandatory.

Clearly, you want your CGI programs to behave somewhat securely. Running a shell against a user-provided string, is, for example, a bad idea. Allowing users to arbitrarily populate your CGI directory with random programs is probably not a good idea either. But no matter how many scare stories you may hear (and you will), you will probably find yourself CGI-ing at one point or another. If you end up programming CGI scripts, you probably will want to check out David Ray's very good CGI security article in WEBsmith (#2, March 1996).

Implementation

Our first task was to choose an http daemon to use, and I admit, at first our choice was somewhat random. At the time, we could have picked CERN's daemon or NCSA's. We chose NCSA's simply because we had no trouble connecting to their server and found a Linux binary distribution of the NCSA daemon there. In those days, we figured that there was enough adventure and excitement in configuring the daemon without having to also port and compile it. We got the Slackware-style httpd distribution from:

ftp://ftp.cc.gatech.edu
/pub/Linux/distributions/Slackware/contrib/httpd.tgz

The Slackware Linux distribution includes the pkgtool utility. This simplifies management of .tgz (tarred and gzipped) archives by treating them as packages - addable and removable as discrete units. Using the pkgtool, we added the httpd.tgz package to our system. After installation, we reviewed the logfile in /usr/adm/packages/httpd, and saw that the root path for the httpd was in /var/httpd; the configuration files were in the ./conf directory. Other distributions use /usr/local/etc/httpd; the subdirectories are the same: ./conf, ./logs, and /cgi-bin.

The documentation included with the distribution was sparse. Fortunately, you can find all sorts of terrific and up-to-date documentation online at:

http://hoohoo.ncsa.uiuc.edu

If you're going to be doing a lot of text documentation browsing from your Linux box (and you will), it is a really good idea to get the Lynx text-only browser. It is fast, convenient, and not at all fattening (to your filesystem, that is). Lynx helped us a lot while we were testing our own server and its pages, too! You can get a description and a chance to download it for free from:

http://www.ukans.edu

There are two ways to configure how the http server listens for requests: standalone (running all of the time) or slave, under the control of inetd. The documentation definitely favors the standalone method, and we never like to argue with the folks who write the software, so we left it running standalone. So, no modification of the inetd.conf file was necessary. (Figure 1 is an example of a standalone configuration.)

The httpd comes with configuration files that are renamed from filename.conf to: filename.conf-dist.

This ensures that you can't simply unpack the distribution and shoot yourself in the foot without at least taking the trouble to rename the files. There are three config files in ./conf of immediate interest: httpd.conf (the main configuration file), srm.conf (server resource management), and access.conf (security). The syntax in all these files is:

ITEM VALUE [VALUE]

There are a lot of ITEMs to configure, and you will definitely want to spend a little bit of time browsing the abovementioned hypertext documentation.

For our purposes, we were able to leave most items alone. We configured site-specific items, such as the ServerAdmin item in the httpd.conf. (See Figure 2 for pertinent excerpts.) Again, you may want to check out http://hoohoo.ncsa.uiuc.edu to gain further understanding of the many specialized configuration options.

For example, there are options that allow your server to offer several different sets of "home" (i.e., index.html) pages depending on which network interface and "hostname" the request comes from. In other words, you could serve a different initial default page for your Ethernet interface (http://jon.dog.com) versus your Token Ring or even "dummy" interface (http://leo.meow.com). See the VirtualHost option if you're interested in doing this. You can also allow users to create their own pages, and so forth. But if your purpose, like ours, is to simply serve one set of documents to the world at large, NCSA has done most of the configuration for you!

Since we had now configured the httpd to be a standalone daemon, we had to invoke it manually. We tested it by typing:

# /usr/sbin/httpd

Oops! It complained that it didn't know about group ID -1, which it runs as for security reasons. We fixed that by editing /etc/group to include a group with ID -1:

nogroup::-1:

We ran the httpd again, and this time it didn't complain. We used the Lynx browser to browser:

http://localhost

and found that, yes indeed, it worked! Since we had no index.html file in the htdocs directory, all we saw was a directory listing of an empty directory - not very exciting. But, after we populated the htdocs directory with a simple index.html file, we were on our way to some serious Web-slinging!

Meanwhile, our public librarians had been busy writing up local reference pages in HTML for our public access terminals. The librarians ftp'd the pages over - and we promptly hit a snag. Their DOS-based systems had only allowed them to use a .HTM extension. When the httpd did not see a .HTML extension, it served the page as plain text. Ugly! We had two choices (because renaming all of those files and editing the links was not attractive): change the default data type to be HTML or add a new MIME type with a file extension of .HTM. We chose to be explicit rather than broad, and added a new MIME type in the srm.conf file (Figure 1), with the line:

AddType text/html htm

Now that things were working, we ensured that the daemon would autostart on bootup, and we edited /etc/rc.d/rc.inet2 to include the httpd:

IN_SERV="lpd httpd"

We ran for a while, being careful to advertise the server as www, a DNS alias to the outside world that we had set up, instead of using the machine's canonical name. After our test period, we rescued a PC from the salvage box, put 8 Mb of RAM into it with a 200 Mb hard drive, and cloned the first machine to it over the network. (See "Using Linux as a Router," Sys Admin, January 1996 for details on cloning Linux over a network.) We initiated the new machine as our permanent production server. All we did was change the DNS alias to point to the new box, and the outside world never noticed the difference.

Security Rears Its ELFin Head

And at this point, our story should have ended, but a CERT (Computer Emergency Response Team) advisory, CA-95:04, alerted us to a serious security problem with the version 1.3 httpd that we were running. So, it was time to upgrade to version 1.5a (the latest and greatest at the time of this writing). We looked to the same site that had provided us with the Slackware httpd package in the first place - it was still at 1.3! Fortunately, NCSA itself provides Linux binaries at:

ftp://ftp.ncsa.uiuc.edu

Not so fortunately, NCSA also links its binaries using the ELF, not a.out, format. This meant that we couldn't use them on our systems until we upgraded our kernel and shared libraries. For the short term, this wasn't acceptable. It was time to bite the bullet and compile the daemon.

We downloaded the sources from the above NCSA ftp site. The compile was a very tame adventure - all we had to do was to unpack the tar file, change into the resulting directory, and type:

make linux

The sources compiled cleanly on our system, which was a relief. If you don't feel like compiling your own a.out sources, you can get our resulting a.out binary distribution from:

ftp://ftp.co.chatham.ga.us/pub

We replaced the old httpd binary with the new, then monitored the ./logs/error_log carefully for a while. So far, everything looks peachy.

One of the neater things about this project was, as a busy MIS department, we turned the content of the Web pages over to our very technically competent Public Library system. After all, they are the public information specialists! It has been a very rewarding symbiotic relationship. We get to work with operating systems and other techno-geek stuff, and they have done an incredible job organizing a set of local pages that reflects local interest, and guides the beginning user to the Web in a very easy and friendly way!

About the Author

Jonathan Feldman works with Unix and NetWare at the Chatham County Government in Savannah, Georgia. He likes to keep things simple so that even he can understand them. His son Leo has just convinced him that "Go, Dog, Go!" is a way of life, and that with copious quantities of Mommy and chocolate milk, anything is possible. He is reachable via email at jonathan@co.chatham.ga.us.