Why is My Web Site Down?
"I'm trying to access our Web site, but it's looks like it's down! What's the problem?"
This is never a fun question to have to answer, no matter how experienced you are at Web site administration. And for those for whom Web server administration is a new task piled on top of a lot of others, it can be particularly stressful. In this article, I'll discuss some troubleshooting tips to help diagnose basic kinds of Web server failures. I'll begin with a simple discussion of the architecture of a typical Web server connection, from the end user to the Web server, and points in between. I'll then look at some methods for pinpointing where typical failures might occur, and break down the problem into five general categories:
- The client or the Internet is the problem.
- An explicit error message is displayed.
- There is a connectivity problem between the server and Internet or vice versa.
- The connection times out or fails.
- Everything looks fine now - the user must have been imagining something.
Finally, I'll discuss the usefulness of automated monitoring, both for gathering historical data to diagnose and prevent problems, and for warning you of existing problems before you get those irate calls and emails.
Your Basic Web Server Connection
The architecture shown in Figure 1 is a typical architecture of a remote Internet client (such as a dialed-in Web browser) connecting to a Web server. The client machine connects to the Internet via an Internet Service Provider (ISP). The ISP connects to the Internet, either directly, or through one or more higher level service providers. In this article, I'll use the term "ISP" to denote all of the service providers between the Internet and the local network. The "Internet" includes the high-capacity backbone links that move data from one region to another. The Web server also connects to an ISP, which in turn connects to the Internet. The Web server's site generally includes some sort of Domain Name Server (DNS), which translates machine names into IP addresses locally.
The Web server runs a minimum of an HTTP daemon, which serves Web page requests and runs server-side processes for generating dynamic Web pages. Typically, the server-side process is invoked via a Common Gateway Interface (CGI), or the Web server's built-in API (Application Programming Interface), and may access other back-end processes, such as databases or search engines. There may also be other daemons running, such as for FTP, News, or Gopher services.
My discussion starts at the user's side and works toward the server's side. The analysis of tools and solutions becomes fuller the closer we get to the server, because that's where an administrator has the most control. An administrator can also indirectly affect the performance of the Web server's ISP, either by contacting the ISP and requesting that the problem be resolved, or by switching to a more reliable ISP. Problems that occur from the Internet out to the client are beyond the administrator's control. It is, however, good to know where the problem lies if possible, both to help the end user and to have hard data on problems areas if an explanation of the incident becomes necessary.
Using redundant network connections to different ISPs (known as multi-homing) can minimize the risk of links failing or congesting. Given enough resources, running the Web site from Web servers at a number of distributed ISPs will give you the best chance for real 24 hour/7 day a week uptime. If you're running an intranet site, some of the remote problems of the Internet go away. Unfortunately, you'll probably still have to figure out exactly where the problem lies, so you'll have some hard evidence when you deal with the network administrator.
Client or Not?
When confronted with a question about the availability of their Web server, the first thing most administrators do is pop open their own browser and access the site themselves. This is a great first step because it can determine whether the Web server is really responding or not. If it is, then you know you're not faced with a major Web server failure, although the problem could still be on your site. When troubleshooting user reported problems, it's important to remember that there's a complete subsystem of client-side software (often confusing to configure) that can make the user think that your site is down. End users don't generally think about the Internet architecture when determining why they have a problem (and they shouldn't have to). They simply try to go to your site, and if it doesn't respond in an expected and timely manner, they assume the site is down. If you're lucky, they'll let you know; otherwise, they'll just go somewhere else.
First, you'll need to figure out whether this is a client problem. If you can talk to the end user, ask them whether they can reach a few other sites (pick your favorites, but make sure that they are reliable and easy to type). If they can't reach them, then the problem is theirs. Their software configuration, their network connection, or their ISP's connection could be at fault. It's handy at this time to refer either to your own monitoring of external Web sites, or to look to one of the Internet monitoring sites (see http://www.yahoo.com/Computers_and_Internet/ \
Internet/Statistics_and_Demographics/Traffic/ for a list of these sites), to see if there is a problem with the Internet itself. Having at least one account with an ISP other than your own can be extremely useful, especially if the end user can't try other sites for you directly.
The Error Message Says...
When the problem isn't occurring with other sites, it's time to look more closely at your own Web site. Certain problems with the Web site show up as error messages in the browser.
If the browser is returning a 404 error but the URL seems correct, or if some URLs on your site work and others don't, then you may have broken links. Check to see if anyone has modified the site lately - changing, moving, or deleting files. Use link-checking software to search for problems. Scan your Web server logs for errors to see if this error has occurred before. If the document really exists in the correct location, there may be a permissions problem of some kind. Sometimes permissions errors surface as a 401 (forbidden) error. Check out both the Web server configuration and the file permissions. Keep in mind which account the Web server is running as when checking file permissions.
If the failure is that the server could not be found, then ask the user to access your server by IP address (for example, 18.104.22.168) instead of domain name (www.yourserver.com). If the IP address works, then there's a problem with the DNS. The DNS could be misconfigured, not getting updated, or it could be missing secondary DNS lookups. You can use nslookup or dig to figure out these problems (see Nemeth et al.).
When some URLs work (such as http://, ftp://, or telnet://) and others don't, then one of the service daemons may be down. Check to make sure that daemon processes are running (by using ps for example) and responding (by connecting with an appropriate piece of client software).
Consistent "Server Busy" errors mean the simultaneous connections to the Web server are maxed out. Questions to ask when diagnosing the cause of this error include:
- Is this a busy time of day?
- Has site traffic increased significantly lately?
- Are there many accesses via slow connections? (connections that take a significant amount of time to close)
If this problem occurs on a regular basis, it may be time to consider upgrading your hardware or buying an additional Web server to handle the load.
A common error that manifests itself in a variety of ways is the failure of a server-side process or script (CGI, executable, servlet, etc.). The most obvious are 500 or 501 errors, but these processes can also display a failure by issuing a terse error code or message, or a polite explanation of the nature of the problem. Tracking down process or script errors is usually a case of software debugging. Check permissions on the script or executable file - if it can't be accessed or executed by the Web server (again, remember which account the Web server is running as), then an error will be returned. For scripts (such as Perl or shell scripts), run a syntax checker on the script to ensure that the code compiles. "Insignificant" changes to a script made directly on the Web server can result in syntax errors (although script changes should always be tested). A new version of an interpreter or libraries can also cause scripts to break.
Try running the script from the command line. Depending on how the script is set up, this can be tricky, especially if the script takes a number of parameters or relies on a host of environment variables (consult an HTTP reference for a list of these variables). Ensuring that the scripts are easy to test from the command line can be a major time saver when a problem occurs. Scanning the Web server's logs for errors can also be helpful, because you may be able to pinpoint the time when the problem started occurring.
Often, when a server-side process returns an error, it's useful to see the raw text that the process returns, because in some cases, the process might be outputting improper HTTP headers, malformed HTML, error messages, or debug text, causing the browser to return an error without displaying the output. To see what an HTTP request is returning, you can use telnet to connect directly to the Web server's TCP/IP port. Here's an example of using telnet to retrieve the output of a script:
telnet www.yourserver.com 80
[if the connection is made, then telnet will wait for input.]
GET /cgi/helloworld.pl HTTP/1.0[return] [return]
After you type the second return, the Web server should respond with some text, including headers (fields that look like "Content-type: text/html") and the rest of the document. You should also see the erroneous output instead of the headers. Note that this simple version of GET doesn't work in all cases, such as when the script is being tricky and responding differently to various browser types. With a relatively simple knowledge of the HTTP protocol however, you can simulate a browser fairly closely.
The Connection Times Out
When the user says something like "I clicked on the link, and then waited and waited, but nothing happened," it's time to examine the server's connections and performance.
Try to retrieve the URL that the user was attempting to reach. If that works, try contacting a machine outside of your site, either by accessing a URL on that machine or by pinging it. A failure of this test indicates that either you or your ISP has a problem - your connection to the Internet is either slow or broken. A traceroute can help uncover where the problem lies. If you know the names or IP addresses of the various routers between your Web server and the Internet, pinging each can help determine where the failure is occurring. Restarting or cycling power on network equipment to make sure that it's properly initialized will sometimes clear up problems. You might also want to take a look outside your facility to see if there are any phone trucks or construction equipment around. Sometimes finding connection problems like this is as simple as talking to the repair person.
If everything seems to be working from the "inside," you may have a problem in the configuration of your firewall or routers. Try using a remote ping or traceroute to diagnose these problems. You can find the URLs for a few traceroute and ping servers in the references at the end of this article. This is also a good time to dial in to your remote ISP account, if you have one, and try a ping or traceroute from there. The purpose is to see if your machine is visible from the outside. These tests can also verify whether the connection is slow and where the delay might be. If your firewall doesn't allow ICMP (ping) packets through, using a remote URL getter can also do the trick. Common causes of this type of error are inadvertent mistakes when changing the firewall or router configurations, or configurations that aren't saved and come up with old (incorrect) settings after power outages.
If you couldn't retrieve the URL the user was having trouble with (the connection hung or timed out), then you should examine the Web server machine more closely. See if you can ping the Web server machine. It's always possible that you've had a hardware meltdown of some kind. More likely, however, the Web server process might be hung, crashed, or simply not running. Misbehaving applications and scripts that crash the Web server or are extremely resource-intensive can cause requests to time out. If the Web server has been rebooted lately, the problem may be as simple as having forgotten to automatically start up a required process, such as the HTTP daemon, at startup. A (re)start of the process often clears up this problem. However, you'll probably want to duplicate the user's actions to try and figure out the cause of the problem.
Another cause may be a process running on the machine which has nothing to do with the Web server, but is eating up enough resources (CPU, memory, etc.) to significantly affect the performance of the Web server. If the offending process is still running at the same gluttonous rate, it will be pretty easy to identify (use your favorite tool, such as top, ps, or sar). By the time you check the machine, however, the process most likely will have either run to completion or gone to sleep.
Full disks can cause a number of operating system errors that manifest themselves in strange and misleading ways. Checking available disk space is simple (df -k) and eliminates looking for more obscure problems. Full disks can be easily caused by fast growing log files, or by a bunch of huge multimedia files that were just added to the site.
Checking basic system performance statistics, such as CPU usage, memory usage, page swapping, I/O performance, and network performance (see Nemeth et.al.) can help determine if the machine has enough horsepower to handle its load. If you find that the server is indeed overloaded in the course of normal operation, then it's probably time to either beef up the hardware or split up the site onto a number of machines.
I Can't Seem to Reproduce the Problem
What if you've gone through all of the previous troubleshooting steps and nothing seems to be wrong? In fact, the user even tried the URL again, and now it works! The problem fixed itself - you're home free. Not! Sporadic errors do occur on the Internet, but usually there's a good reason for these errors, and they can often be prevented. The reason these problems can seem flaky is that the vast majority of them go unreported by end users. To see the unreported ones, some sort of automated monitoring system is extremely useful. Basically, a monitoring system can do the same thing that an end user does (access Web pages, name lookup) or that a diagnostic tool does (ping, ps, or df), except that it performs these actions periodically and records the results. By reviewing the results of this monitoring, you can form a more complete picture of how your Web server is performing over time and the system state before and during a failure.
Here's an example to illustrate the usefulness of having a history of your site's performance. A Web administrator who had been running automated monitoring software said he thought there was something wrong with the monitoring setup. Requests for Web pages from his server were shown as timing out at the top of every hour, and he hadn't heard any complaints from his users. When we tried out his site manually with a browser, sure enough, at the top of the hour, the server response was incredibly slow (as in minutes to load a Web page). It turned out that there was a scheduled job being run on the hour, whose resource requirements slowed the server to a crawl. Without the historical data, the users who might have eventually reported this problem would have had to provide quite accurate time information to find this pattern.
Automated monitoring systems exist in a large number of forms: commercial products, shareware, freeware scripts, and homegrown solutions. They all share the common abilities of periodically measuring some aspects of the system and recording this data. By keeping historical records, trends that will lead to problems can be nipped in the bud. Knowing that a disk is nearing capacity allows some quick file cleaning or additional storage to be added, before the system comes to a screeching halt.
An Ounce of Prevention
A number of things can be done to keep your Web server running smoothly. There are significant differences in the error rates between sites that are run well and sites that aren't. The Internet isn't as random a frontier as it's sometimes portrayed to be. Here are some tips to get your Web server running smoothly and cleanly.
- Don't put anything unnecessary on the Web server. For many administrators, this is already a hard and fast rule, but it bears repeating. There are enough pieces that can go wrong without introducing more. Also, don't install software on Friday evening or just before you leave on a week-long vacation.
- Make all site updates on a test server first. To combat link errors and broken scripts, make all your site updates to a test Web server, and then run a suite of tests on it (including a link check). Only after these tests are passed should you update your main Web server. It's tempting (especially for those not used to maintaining running systems) just to make a "simple" change directly to the HTML or script files on the main Web server. If this seems to be a problem for you, run a link checker against your Web site every day.
- Keep track of the quality of your Internet connection. If your ISP consistently experiences problems, either get them to fix the problems or switch ISPs. Automated monitoring can help detect ISP problems, and there are services available that test and rate ISP reliability.
- Know about problems before your users do. If you are using an automated monitoring tool attached to an alerting mechanism such as a pager, you'll be contacted when errors occur, allowing you to diagnose and fix a problem before your users are affected. Make sure that the alerts are configured so that the errors reported are real; nothing defeats an automated alerting system faster than a string of bogus "emergencies." Best of all, when certain kinds of problems are detected (such as crashed Web servers or full disks), automated corrective actions can be taken, which is greatly preferable to hopping in the car (or logging in remotely) at 3 a.m. just to type ./restart at a command line. After a problem has been fixed, automated monitoring can confirm that everything is operational.
This article has provided a starting point for diagnosing Web server problems, and helping to prevent those nasty phone calls. To a large extent, diagnosing problems on a Web server is the same as determining the source of other problems on a system. The biggest difference involves complexities introduced by the communications links to your Web server. You can, however, use the same logical progression diagnosis techniques for both sets of problems. I'd be very interested in hearing about other tips and techniques that you use for keeping your Web server running smoothly. Drop me an email at email@example.com.
Spainhour, S. and V. Quercia. 1996. Webmaster In a Nutshell. O'Reilly and Associates, Sebastopol, CA.
Nemeth, E., G. Snyder, S. Seebass, and T.R. Hein. 1995. UNIX System Administration Handbook. Prentice Hall.
Retrieve a URL, do a traceroute, or ping a host from Freshwater Software's site:
About the Author
Pete Welter (firstname.lastname@example.org) is a senior software engineer at Freshwater Software, Inc. (http://www.freshtech.com) in Boulder, CO. He spends his days talking to Web administrators about monitoring their Web sites and developing commercial Web monitoring tools.