The Art
of Spidering
Reinhard Erich Voglmaier
Every Webmaster will encounter a robot sooner or later. He or
she will typically find proof of the activities of "strange
browsers" in logfiles. So, what is a robot? A robot (also known
as a spider) is a procedure that visits Web sites with the objective
of gathering information. The robot does not limit itself to getting
the information from just one Web page, but also tries to get the
links mentioned in this page. If it kept going in this way by following
all the links, it would eventually spider the whole Internet. This
means the robot needs limits defined in its configuration file that
tells it where to stop, which I'll discuss later in the article.
Sometimes the activities of robots are welcome inasmuch as they
provide important information about the content of the spidered
site to possible users. Sometimes, however, the visits are not wanted,
especially when they begin to occupy a large bandwidth, penalizing
the traffic for which the Web site was originally intended. For
this reason, there are so-called robot rules that "well-behaving"
robots obey. This good behavior is called "netiquette"
and must be programmed into the robot or spider.
In this article, I will explain how to write a spider that completely
mirrors a Web site. Note that I use the words spider and robot interchangeably,
which is how you will find it in the literature also. The first
script described in this article copies simple Web pages from a
remote site to a local site. I will also integrate a parser that
extracts the hyperlinks contained in the copied Web pages and show
how to complete this approach with a stack object that handles the
download of the documents needed. I will also look at what software
exists, because we do not want to reinvent the wheel. These programs
are all available from the Perl sites (http://www.perl.org
and http://www.perl.com).
All the examples in this article are written in Perl, standard
on UNIX operating systems, but also available for VMS, Win32 architectures,
and others. The auxiliary libraries are available from CPAN (http://www.cpan.org/).
To locate them, I highly recommend using the search engine located
on the server of the University of Winnipeg:
http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html
Copying Single Pages
To copy a single Web page, we must set up a couple of things.
We first arrange for a connection to the Web server on which the
page we want to copy lives. This may involve also contacting a proxy
server, authenticating ourselves against the proxy server, and setting
up some parameters for the connection. Fortunately, there is a library
(LWP) available on CPAN that does what we need. The script in Listing
1 copies a page from a remote Web server to the local file system.
(Listings for this article can be found on the Sys Admin Web site
at: http://ww.sysadminmag.com.)
The first lines of Listing 1 set up the remote page and the location
where the page should go. Line 7 sets up the most important data
structure, called the UserAgent. It holds all information about
the connection. For now, we'll use the simplest form. The request
is very easy to write; we want to "GET the Remote Page",
so the request reads:
HTTP::request->new('GET',$RemotePage)
Once we've defined a request object, we can fetch the page with
the method request(). This method simply copies the RemotePage
to the LocalPage.
Copying Single Pages Realistically
The previous situation was easy but not very realistic, because
there was no firewall. We got the page quickly, and no user credentials
were required. Listing 2 shows how to extend the first approach.
In lines 15-18, we set up up the name of the proxy server our spider
should use and the domains that do not need proxying. We define
the name of our browser to let the spidered side know what's
happening. Finally, we set up the time-out after which we will abort
the script.
When we get the page, we control the exit status of the request.
We therefore have to use the HTTP::Status library. We don't
copy the page on the local file system, but keep it in memory for
later use. With the status code (I use the code() method),
we also get a human-readable explanation of success or failure information
using the method message(). These are not the only options
the UserAgent understands, however. For a complete list, look at
the man pages delivered with the UserAgent library.
Parsing the Web Pages
In many cases, what we've already done is enough. On my own
Web sites, I have a number of pages that contain information mirrored
from other Web sites. But because we want to get not only a single
Web page, but a whole site, the spider must follow the links contained
in the downloaded pages. We need an HTML parser, and we have at
least two options. The first option is to use the built-in functions
of Perl using regular expressions. Because HTML has a well-defined
syntax to describe references (substantial links to other documents,
images, or similar objects included in Web pages), this is not very
difficult. The second option is to use the Link Extractor module
(CPAN, as I mentioned before).
I will show the method of coding the parse process by hand. Remember,
the goal is to mirror the Web site on our local file system, so
the user clicking on a link on our local server also expects that
the documents referenced in this page (images, for example) to be
on the same server. Thus, the parser must transform the link to
work on the local system as well. In Perl, this looks like:
HTMLPage =~ s/RegularExpression/TransformURL()/eig
The switch "e" means that we want to substitute the
regular expression with the return value of the function. The "i"
switch ignores case; "g" expresses that we want to
have the command executed not just for the first regular expression,
but for all of them found in the document.
In this example, we are interested in Links and Images, so we
will scan for expressions like:
<a href="./to_another_document.html" > Click here </a>
<img src="./pictures/One.jpg" .... ...... ..... >
The easiest form of the regular expression to match these links is:
s#<(a href=)"([^"]+)"#TransformURL($1,$2)#eig
s#<(img src=)"([^"]+)"#TransformURL($1,$2)#eig
(If you need a short introduction on regular expressions, I recommend
one of the Perl books or sites such as http://www.perl.com/pub/p/Mastering_Regular_Expressions.)
These two examples will not find all links. For example, if there
is a space between "href" and "=",
it will not be found, so you will need to put in something like
this:
\s*
which means zero or more spaces. You will need more lines to match
background, images, sound, and so on. I recommend beginning with this
simple structure and then adding more features in order to catch some
obvious syntax errors. In the listings available from www.sysadminmag.com,
you will find what I'm using on my site.
Memory Structures
With the previously explained parsing mechanism, we can now get
all the pages or images of objects referenced in the downloaded
pages. If the spidered Web site does not reference external sites,
we get the whole Web site. But even if this condition is true, there
are still other problems. What if several pages are referenced more
than once or if two pages reference each other? We would then simply
end up in a loop.
This means that we need memory structures to keep track of which
pages to visit and which pages have already been downloaded. There
are four arrays holding the data for the download decisions:
- IncludedURLs
- ExcludedURLs
- VisitedURLs
- ToVisitURLs
The first two arrays tell the name space in which the spider is working.
IncludedURLs contains the list of URLs we want to get, and ExcludedURLs
tells which pages we don't wish to be considered. The second
two arrays contain housekeeping information, URLs that don't
need to be downloaded (VisitedURLs), and the URLs our spider still
needs to download. The last array is a stack filled by the TransformURL
function. When filling the array, the TransformURL function consults
the other three arrays, as well. See Listing 3.
The Request Loop and Helpers
We can now open an HTML page, examine the pages and images to
which it is pointing, and retrieve these pages and images. The dynamic
part of the spider and the request loop are still missing. Simply
written, it looks like this:
while ( <condition> ) { getPage(); }
The condition indicates whether we need to get pages from the site.
Remember that we put all the pages yet to be visited in an array.
The array is initialized with the first page we want to spider. Every
page referenced will be put in the array and after it is visited,
a page will be erased from the array. You can make a procedure that
says whether the array is empty or not, which is very handy for controlling
other conditions, too. The procedure could also dump out the memory
structure if the user wanted to let the spider stop and restart work
later.
We need to continuously transform local and remote URLs from relative
to absolute and back. If you put the pages on a local Web server,
you must also transform the physical paths as needed to download
the pages to your local Web site to the logical view of these pages
appearing in the links. For example, if you have a link such as:
href="webmaster/Java/Intro.html
this may be stored on your file system as:
/disk1/htdocs/webmaster/Java/Intro.html
For this purpose, I have provided the helper functions in the listings.
Just Existing Software
It is always worthwhile to look for ready-to-use software. The
first tool I recommend is LWP::RobotUA. This provides the
normal UserAgent library used before, and also provides methods
to let you obey the robot rules, thus allowing you to see if robots
are welcome (consult the robot.txt file). See:
http://info.webcrawler.com/mak/projects/robots/robots.html
for more information on robot rules.
Another application is w3mir, powerful Web mirroring software
available at:
http://langfeldt.net/w3mir/
It can use a configuration file and may be useful for your needs.
Furthermore, it is well documented. The above example retrieves the
contents of the Sega site through a proxy, pausing for 30 seconds
between each document it has copied from the w3mir documentation:
w3mir -r -p 30 -P www.foo.org:4321 http://www.sega.com/
Another option is Webmirror, available at:
http://www.math.fu-berlin.de/~leitner/perl/WebMirror-1.0.tar.gz
This program has a good man page and is also easy to use. It can use
a configuration file where you explain the more complicated options.
Although there is existing software, we often want to achieve
"special effects", or we may have special needs that make
using own script as a front end of a search engine spider a better
choice.
Limits
I have described a powerful tool capable of copying an entire
Web site on your local system with just the initial URL and some
rules about where to stop. Clearly, there is other information you
may need -- such as configuration of a proxy server. Also keep
in mind that real life is never so easy. Let's look at some
examples.
There are a lot of dynamic Web sites around -- I don't
mean sites with a lot of images moving and blinking; I mean sites
that are constructed depending on the user input, such as forms.
Forms offer many possibilities to send user choices to the Web server.
To deal with these cases, you'll need to build more intelligence
into your spider. The LWP libraries offer the ability to simulate
the click on the Submit button of a form, but need to simulate all
choices the user has in order to get the complete picture. If you
consider the timetable of a railway company or of "Lufthansa"
for example, it will give you an idea of what this could mean. It
depends heavily on what you expect from your spider; it may sometimes
be convenient to exclude these pages. To exclude them, just do nothing.
If there's no link on the pages containing the form, your spider
will not follow it.
Cases where a lot of user input is required will cause complications.
In the worst case, your spider will continue to send incomplete
data, and the Web server will continue to answer with a page, which
is what the spider expects. But, you can guess what will happen.
To address this possibility requires more intelligent stop clauses.
The discussion of these interesting features would be out of scope
of this article. Another inconvenient thing that could happen is
too-generous servers. Have you ever seen a server that, instead
of sending one Web page, sends two or more? These situations will
also confuse the scripts described above, and the scripts are not
designed to handle these situations.
It is not enough to have the start URL and the stop conditions.
You should also have an understanding of what type of pages will
your spider will encounter. More importantly, you must have a clear
understanding of what your spider is used for and what pages it
better should leave alone.
Conclusion
This article explained how to write a robot to automatically get
Web pages from other sites on to your computer. It showed how to
get just one page or mirror a whole site. This type of robot is
not only useful for downloading or mirroring remote Web sites, but
is also handy as a front end for a search engine as a link control
system. Instead of copying the pages on your local site, you control
whether the referenced pages or images exist.
It is good practice to respect the robot rules and be a fair player
in the Internet. Avoid monopolizing or blocking Web sites with too
many requests or frequent requests.
This article provided one of the possible approaches and is only
an example of how you might proceed in developing a stable application
for your needs. If there's already an existing application,
you may be able to use or extend it. But in any case, you can learn
from it.
Reinhard Voglmaier studied physics at the University of Munich
in Germany and graduated from Max Planck Institute for Astrophysics
and Extraterrestrial Physics in Munich. After working in the IT
department at the German University of the Army in the field of
computer architecture, he was employed as a Specialist for Automation
in Honeywell and then as a UNIX Systems Specialist for performance
questions in database/network installations in Siemens Nixdorf.
Currently, he is the Internet and Intranet Manager at GlaxoWellcome,
Italy. He can be reached at: rv33100@GlaxoWellcome.co.uk.
|