Article

jun2001.tar

The Art of Spidering

Reinhard Erich Voglmaier

Every Webmaster will encounter a robot sooner or later. He or she will typically find proof of the activities of "strange browsers" in logfiles. So, what is a robot? A robot (also known as a spider) is a procedure that visits Web sites with the objective of gathering information. The robot does not limit itself to getting the information from just one Web page, but also tries to get the links mentioned in this page. If it kept going in this way by following all the links, it would eventually spider the whole Internet. This means the robot needs limits defined in its configuration file that tells it where to stop, which I'll discuss later in the article.

Sometimes the activities of robots are welcome inasmuch as they provide important information about the content of the spidered site to possible users. Sometimes, however, the visits are not wanted, especially when they begin to occupy a large bandwidth, penalizing the traffic for which the Web site was originally intended. For this reason, there are so-called robot rules that "well-behaving" robots obey. This good behavior is called "netiquette" and must be programmed into the robot or spider.

In this article, I will explain how to write a spider that completely mirrors a Web site. Note that I use the words spider and robot interchangeably, which is how you will find it in the literature also. The first script described in this article copies simple Web pages from a remote site to a local site. I will also integrate a parser that extracts the hyperlinks contained in the copied Web pages and show how to complete this approach with a stack object that handles the download of the documents needed. I will also look at what software exists, because we do not want to reinvent the wheel. These programs are all available from the Perl sites (http://www.perl.org and http://www.perl.com).

All the examples in this article are written in Perl, standard on UNIX operating systems, but also available for VMS, Win32 architectures, and others. The auxiliary libraries are available from CPAN (http://www.cpan.org/). To locate them, I highly recommend using the search engine located on the server of the University of Winnipeg:

http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html

Copying Single Pages

To copy a single Web page, we must set up a couple of things. We first arrange for a connection to the Web server on which the page we want to copy lives. This may involve also contacting a proxy server, authenticating ourselves against the proxy server, and setting up some parameters for the connection. Fortunately, there is a library (LWP) available on CPAN that does what we need. The script in Listing 1 copies a page from a remote Web server to the local file system. (Listings for this article can be found on the Sys Admin Web site at: http://ww.sysadminmag.com.)

The first lines of Listing 1 set up the remote page and the location where the page should go. Line 7 sets up the most important data structure, called the UserAgent. It holds all information about the connection. For now, we'll use the simplest form. The request is very easy to write; we want to "GET the Remote Page", so the request reads:

HTTP::request->new('GET',$RemotePage)

Once we've defined a request object, we can fetch the page with the method request(). This method simply copies the RemotePage to the LocalPage.

Copying Single Pages Realistically

The previous situation was easy but not very realistic, because there was no firewall. We got the page quickly, and no user credentials were required. Listing 2 shows how to extend the first approach. In lines 15-18, we set up up the name of the proxy server our spider should use and the domains that do not need proxying. We define the name of our browser to let the spidered side know what's happening. Finally, we set up the time-out after which we will abort the script.

When we get the page, we control the exit status of the request. We therefore have to use the HTTP::Status library. We don't copy the page on the local file system, but keep it in memory for later use. With the status code (I use the code() method), we also get a human-readable explanation of success or failure information using the method message(). These are not the only options the UserAgent understands, however. For a complete list, look at the man pages delivered with the UserAgent library.

Parsing the Web Pages

In many cases, what we've already done is enough. On my own Web sites, I have a number of pages that contain information mirrored from other Web sites. But because we want to get not only a single Web page, but a whole site, the spider must follow the links contained in the downloaded pages. We need an HTML parser, and we have at least two options. The first option is to use the built-in functions of Perl using regular expressions. Because HTML has a well-defined syntax to describe references (substantial links to other documents, images, or similar objects included in Web pages), this is not very difficult. The second option is to use the Link Extractor module (CPAN, as I mentioned before).

I will show the method of coding the parse process by hand. Remember, the goal is to mirror the Web site on our local file system, so the user clicking on a link on our local server also expects that the documents referenced in this page (images, for example) to be on the same server. Thus, the parser must transform the link to work on the local system as well. In Perl, this looks like:

HTMLPage =~ s/RegularExpression/TransformURL()/eig

The switch "e" means that we want to substitute the regular expression with the return value of the function. The "i" switch ignores case; "g" expresses that we want to have the command executed not just for the first regular expression, but for all of them found in the document.

In this example, we are interested in Links and Images, so we will scan for expressions like:

<a href="./to_another_document.html" > Click here </a>
<img src="./pictures/One.jpg" .... ...... .....  >

The easiest form of the regular expression to match these links is:

s#<(a href=)"([^"]+)"#TransformURL($1,$2)#eig
s#<(img src=)"([^"]+)"#TransformURL($1,$2)#eig

(If you need a short introduction on regular expressions, I recommend one of the Perl books or sites such as http://www.perl.com/pub/p/Mastering_Regular_Expressions.)

These two examples will not find all links. For example, if there is a space between "href" and "=", it will not be found, so you will need to put in something like this:

\s*

which means zero or more spaces. You will need more lines to match background, images, sound, and so on. I recommend beginning with this simple structure and then adding more features in order to catch some obvious syntax errors. In the listings available from www.sysadminmag.com, you will find what I'm using on my site.

Memory Structures

With the previously explained parsing mechanism, we can now get all the pages or images of objects referenced in the downloaded pages. If the spidered Web site does not reference external sites, we get the whole Web site. But even if this condition is true, there are still other problems. What if several pages are referenced more than once or if two pages reference each other? We would then simply end up in a loop.

This means that we need memory structures to keep track of which pages to visit and which pages have already been downloaded. There are four arrays holding the data for the download decisions:

IncludedURLs
ExcludedURLs
VisitedURLs
ToVisitURLs

The first two arrays tell the name space in which the spider is working. IncludedURLs contains the list of URLs we want to get, and ExcludedURLs tells which pages we don't wish to be considered. The second two arrays contain housekeeping information, URLs that don't need to be downloaded (VisitedURLs), and the URLs our spider still needs to download. The last array is a stack filled by the TransformURL function. When filling the array, the TransformURL function consults the other three arrays, as well. See Listing 3.

The Request Loop and Helpers

We can now open an HTML page, examine the pages and images to which it is pointing, and retrieve these pages and images. The dynamic part of the spider and the request loop are still missing. Simply written, it looks like this:

while ( <condition> ) { getPage(); }

The condition indicates whether we need to get pages from the site. Remember that we put all the pages yet to be visited in an array. The array is initialized with the first page we want to spider. Every page referenced will be put in the array and after it is visited, a page will be erased from the array. You can make a procedure that says whether the array is empty or not, which is very handy for controlling other conditions, too. The procedure could also dump out the memory structure if the user wanted to let the spider stop and restart work later.

We need to continuously transform local and remote URLs from relative to absolute and back. If you put the pages on a local Web server, you must also transform the physical paths as needed to download the pages to your local Web site to the logical view of these pages appearing in the links. For example, if you have a link such as:

href="webmaster/Java/Intro.html

this may be stored on your file system as:

/disk1/htdocs/webmaster/Java/Intro.html

For this purpose, I have provided the helper functions in the listings.

Just Existing Software

It is always worthwhile to look for ready-to-use software. The first tool I recommend is LWP::RobotUA. This provides the normal UserAgent library used before, and also provides methods to let you obey the robot rules, thus allowing you to see if robots are welcome (consult the robot.txt file). See:

http://info.webcrawler.com/mak/projects/robots/robots.html

for more information on robot rules.

Another application is w3mir, powerful Web mirroring software available at:

http://langfeldt.net/w3mir/

It can use a configuration file and may be useful for your needs. Furthermore, it is well documented. The above example retrieves the contents of the Sega site through a proxy, pausing for 30 seconds between each document it has copied from the w3mir documentation:

w3mir -r -p 30 -P www.foo.org:4321 http://www.sega.com/

Another option is Webmirror, available at:

http://www.math.fu-berlin.de/~leitner/perl/WebMirror-1.0.tar.gz

This program has a good man page and is also easy to use. It can use a configuration file where you explain the more complicated options.

Although there is existing software, we often want to achieve "special effects", or we may have special needs that make using own script as a front end of a search engine spider a better choice.

Limits

I have described a powerful tool capable of copying an entire Web site on your local system with just the initial URL and some rules about where to stop. Clearly, there is other information you may need -- such as configuration of a proxy server. Also keep in mind that real life is never so easy. Let's look at some examples.

There are a lot of dynamic Web sites around -- I don't mean sites with a lot of images moving and blinking; I mean sites that are constructed depending on the user input, such as forms. Forms offer many possibilities to send user choices to the Web server. To deal with these cases, you'll need to build more intelligence into your spider. The LWP libraries offer the ability to simulate the click on the Submit button of a form, but need to simulate all choices the user has in order to get the complete picture. If you consider the timetable of a railway company or of "Lufthansa" for example, it will give you an idea of what this could mean. It depends heavily on what you expect from your spider; it may sometimes be convenient to exclude these pages. To exclude them, just do nothing. If there's no link on the pages containing the form, your spider will not follow it.

Cases where a lot of user input is required will cause complications. In the worst case, your spider will continue to send incomplete data, and the Web server will continue to answer with a page, which is what the spider expects. But, you can guess what will happen. To address this possibility requires more intelligent stop clauses. The discussion of these interesting features would be out of scope of this article. Another inconvenient thing that could happen is too-generous servers. Have you ever seen a server that, instead of sending one Web page, sends two or more? These situations will also confuse the scripts described above, and the scripts are not designed to handle these situations.

It is not enough to have the start URL and the stop conditions. You should also have an understanding of what type of pages will your spider will encounter. More importantly, you must have a clear understanding of what your spider is used for and what pages it better should leave alone.

Conclusion

This article explained how to write a robot to automatically get Web pages from other sites on to your computer. It showed how to get just one page or mirror a whole site. This type of robot is not only useful for downloading or mirroring remote Web sites, but is also handy as a front end for a search engine as a link control system. Instead of copying the pages on your local site, you control whether the referenced pages or images exist.

It is good practice to respect the robot rules and be a fair player in the Internet. Avoid monopolizing or blocking Web sites with too many requests or frequent requests.

This article provided one of the possible approaches and is only an example of how you might proceed in developing a stable application for your needs. If there's already an existing application, you may be able to use or extend it. But in any case, you can learn from it.

Reinhard Voglmaier studied physics at the University of Munich in Germany and graduated from Max Planck Institute for Astrophysics and Extraterrestrial Physics in Munich. After working in the IT department at the German University of the Army in the field of computer architecture, he was employed as a Specialist for Automation in Honeywell and then as a UNIX Systems Specialist for performance questions in database/network installations in Siemens Nixdorf. Currently, he is the Internet and Intranet Manager at GlaxoWellcome, Italy. He can be reached at: rv33100@GlaxoWellcome.co.uk.