Devouring
Spam with Spamivore
Charles V. Leonard
How many of us are bombarded each day with undesirable junk email,
commonly known as spam? How do you purge your email account of unsolicited
spam? This article presents an application called Spamivore. Spamivore
attacks spam at its source by disabling the harvesting agents that
steal email addresses.
Spamivore consists of a 365-line Perl CGI (Common Gateway Interface)
and a 75-line supplemental HTML template used to execute the script.
The development platform was a Sun Solaris 2.x, System V, Release
4.0, but it has been ported to Windows 95/98/NT. The Web server
used in development of the script is Apache 1.3.9. The browsers
used for testing the scripts are Netscape 4.0 and Internet Explorer
5.0.
Spambots vs. Robots
There are many ways spam fills your mailbox. A common method involves
spammers stealing email addresses off your Web site by using a spambot.
A spambot is a specific form of a broader class of Internet applications
known as robots. Robots are software applications running unattended
on the Internet collecting and cataloging information about Web
sites. Most robots are legitimately helpful applications. A common
robot, for example, is a Web crawler. Search engines such as Yahoo!,
Altavista, Excite, Lycos, and Google use Web crawlers to update
their databases. Web crawlers put your Web site on the Internet
map.
Spam, Spam, Spam
Unlike Web crawlers, spambots are malicious entities; the email
addresses spambots collect are used to send mostly unwanted material.
How do spammers obtain email addresses? According to Michael A.
Banks in "Beating the E-Mail Spammers", doing any of the
following will potentially place you on a list:
- Post on an online service's or Internet bulletin board
- Post in a Usenet newsgroup
- Spend time in chat rooms on an online service
- Have a listing in an online service's member directory
Perhaps a few spammers collect email addresses manually by going
from site to site, but more spammers automate this task by using
spambots. Spamivore defends Web sites against these uninvited, intruding
spambots. (See the sidebar for more on spambots).
Spamivore Beginnings
I became more than slightly annoyed about all the spam in the
world (especially in my POP3 account), so I decided to research
what could be done to stop it. One of the applications that kept
being referenced at almost every anti-spam site I visited was Webpoison
(or Wpoison). I read about Webpoison, and I thought the concept
was great.
I decided to create my own Spambot trap when I learned I would
need to modify the http.conf file if I wanted to disguise
the CGI references as regular links. Because I manage two sites
located on Web hosting services, where I do not have the necessary
administration rights to make these modifications), I was left to
contemplate an alternative plan.
I came up with the idea of using Server Side Includes (SSI) as
an alternative way to cloak the spambot trap, which would not require
httpd administration rights. I chose the name Spamivore after viewing
a local news segment about Carnivore, the FBI's own potentially
intrusive application. (Ironically, Carnivore, like spambots, has
an interest in email.)
Spamivore Design Overview
The basic design for Spamivore is as follows:
1. Place a hidden link in the main HTML page that contains a link
to critical data (email addresses) that I do not want spambots to
access.
2. The hidden link will be not be seen by anyone except uninvited
robots.
3. Following the hidden link presents a page containing a Server
Side Include.
4. The SSI references a CGI script that randomly generates a page
full of email addresses, creates another SSI Web page, and creates
a corresponding link to this Web page for the spambot to follow.
5. The newly created SSI Web page contains another reference to
the CGI script. This CGI reference is never actually seen by the
spambot because the http daemon has already replaced it with the
content generated by the CGI before it reaches the spambot.
6. The CGI script contains a mechanism to maintain the physical
.SHTML (SSI) documents that are created.
7. After the spambot follows the third dynamic link, the CGI script
deletes from the system all the .shtml documents created
by the spambot (through our CGI script), except for the current
one it is viewing and the next one the spambot wants to link to.
8. At this point, there is the possibility that the spambot is
trapped and unable to go backwards to intrude on other areas of
the Web site. If the Spambot's history cacheing is not very
sophisticated (i.e., the Web page must exist on the Web site), the
Spambot will not be ble to go back to the previous page because
that page no longer exists (as it was deleted in step 7).
9. To protect against good robots from falling into our trap,
add an entry to robots.txt that informs the robot to stay
away from these areas that have planted traps.
10. As an added safety measure for normal users who accidentally
fall into our trap, add a second link below the .shtml link
that allows the user to go back to the main homepage
Code Development
The Main Home Page
First, create the main HTML page for the Web site to be protected
against spambots. Since I use SSI in this page, the file will end
with an .shtml extension (all SSI pages end in this extension).
The example site is called "Contact Busters of America".
As its name implies, Contact Busters is a fictitious group dedicated
to the abolition of intrusive interruptions in life such as junk
email, junk mail, telephone solicitation, and other spammish enterprises.
In Listing 1 (the index.shtml for our fictitious Web site),
I use SSI to display today's date with the statement <!--#echo
var="DATE_LOCAL" -->. If SSI is enabled, the server replaces
this statement with today's date before delivering the page
to the client Web browser.
The anchor <a href="sa/saelist1.shtml"> is the link
that begins the spambot's journey through the spam trap. Note
the <FONT>,</FONT> tags around the word
"Contacts". This statement hides the link from visual
users. Since the background is black for this area, changing the
link font to black will make this link invisible to normal users,
but not to spambots. The COLOR="#000000" attribute changes
the normal link color to black regardless of whether the user has
visited this page before (at least for Netscape Version 4.0 and
Microsoft Internet Explorer 5.0).
Although a spambot normally attacks the homepage of a Web site,
it is not necessarily the first or only page the spambot visits.
Any page containing email addresses should have a hidden link as
the first link of the page. Spambots, as most robots, will simply
follow the first link they come to unless otherwise restricted.
(I will discuss this later in the robots.txt section).
The Initial Spambot Trap Page
As I previously discussed, the Contact Busters main page has a
hidden link to sa/saelist1.shtml. The page sa/saelist1.shtml
introduces the spambot to Spamivore. The HTML code for saelist1.shtml
is described in Listing 2. The statement <!--#exec cgi="/cgi-bin/Contacts.cgi"
--> instructs the Web server to execute the script Contacts.cgi
and replace the statement with any output generated by Contacts.cgi.
Contacts.cgi
When the initial saelist1.shtml is visited, Contacts.cgi
executes and displays a page full of email addresses. This is typical
of spambot traps, and Spamivore is not any different. This, however,
is not all Contacts.cgi does. Because I want to hide that
Contacts.cgi is a CGI script, I create physical-looking links
to any new page that is dynamically generated; this includes creating
links with the actual file extension .shtml.
To do this, physical .shtml Web pages must be created,
stored, and maintained on the Web site. When looking at the initial
spambot Trap page, saelist1.shtml, it might be overlooked
that Contacts.cgi script does not dynamically create the
entire content of the Web page you are viewing. Instead, it is using
a host physical Web page that already exists on the server and only
generates content for a partial portion of that existing Web page.
Maintaining physical Web pages created dynamically by a CGI script
presents a whole new problem. After making the choice to create
these pages, when do you delete them? This was one of the primary
issues when developing the script.
Contacts.cgi Code Review
The Main Function
Contacts.cgi, Listing 3, informs the Web server of what
type of content to generate. The statement print "Content-type:
text/html\n\n"; tells the server to generate regular text or
HTML content. The next two lines are debug statements that allow
examination of the argument passed to the script and the IP address
of the remote client. The function clDebug() prints whatever
is passed to it whenever the global debug variable $bDebug
is set.
All debug information is enclosed in HTML comments so the client
application ignores them. Typically, most graphical browsers will
not display comments, unless the "Page Source" option
is selected. Because spambots have access to these comments, turn
them off when Spamivore is installed on your site.
The next section of code, beginning with if (@ARGV), checks
whether an argument has been passed to the script. A defined argument
sets the Run Count. The Run Count is later used in the program to
control document purging. Document purging involves purging of .shtml
files previously generated and no longer in use by the spambot.
After determining the Run Count, call function clGenMailAddrs().
As the name implies, it generates the random email addresses returned
to the spambot. After generating a page full of email addresses,
call the clCrNextTrapPage() function.
The clCrNextTrapPage() function creates the next .shtml
file I want the spambot to link to. After the new .shtml
is created and saved to a public area on the Web site, create a
link to the new .shtml page by calling the next two print
statements (the paragraph tag of line 29, and the anchor tag of
line 30).
Finally, call the clAddDelFile() function. This function
adds the new file I have just created to the Remove List. The list
is a file stored on the Web site to keep track of the random .shtml
files created and to be removed at a later time. The function clAddDelFile()
also determines document removal.
Function clGenMailAddrs
This function is responsible for generating the random banquet
of email addresses for the spambot. We have a finite set of nine
arbitrary email account name prefixes and a set of eight fictitious
email provider domain names. The nine email account name prefixes
are stored in array @elist1, and the eight fictitious domain
names are stored in array @elist2. Both are global arrays
defined at the beginning of the program.
The function first seeds the random number generator. Because
no argument is passed to random number generator function srand,
Perl automatically uses the current system time as the seed.
Next, enter the Until loop that generates 50 randomly created
email addresses. Fifty addresses were chosen because they fit in
a browser display, and a second link (below the email listings)
can be viewed by a typical user without scrolling.
This second link that I want to account for will, of course, allow
a viewer to escape from our spambot trap. It also is a reasonable
amount of email addresses to contain on a page without tipping off
the spambot that the page is a set-up. However, I have no hard statistics
on this claim.
The "Until" loop is the basis of our random generator.
Each iteration of the loop will creates a new random email address
and corresponding "mailto" HTML anchor. The first line
within the loop is a counter to keep track of how many phony email
addresses are being created. The next line creates a random number
with a limit of five digits. The variable $iNumExt is the
second part of the name used to create the email account. I will
also use $iNumExt later in function clCreateTrapPage
as the second part of the .shtml file name.
The statement $iName=((rand) * 10) % 9 creates the index
value for the first part of our email account name. The statement
$iHost=((rand) * 10) % 8 creates the index value to the name
used for the phony domain name of our email account. The print
statement that follows the prior three random value assignments
creates the actual mailto: account name.
The variable $iName is used to index the array @elist1
(the first part of our email account name), and $iHost is
used to index the array @elist2 (the name of our phony account's
domain name), and $iNumExt is used as the second part of
our email account's name. For example, if $iNumExt were
1215, and $iName were 2, and $iHost were 7, the generated
line of HTML code by the print statement would be:
<a href="mailto:sandymims1215@ibm0.com">sandymims1215@ibm0.com</a>
Note the print statements have backslashes preceding each reference
to the "@" symbol. In Perl, the "@"
symbol is a reserved word used to reference and define array variables.
The backslash is used to escape the symbol, so it is a literal reference
in string values.
Listings 4 and 5 reference the code used to define the global
email arrays and the clGenMailAddrs function.
Function clCrNextTrapPage
The clCrNextTrapPage() function creates the next .shtml
page to which the spambot will link. The function creates a page
similar to the first initial .shtml page (saelist1.shtml).
Recall that saelist1.shtml is the page to which the spambot
first links from the hidden link in index.shtml. Listing
6 references the code used to define clCrNextTrapPage.
The first statement, $sHtmlName="$sEmRootName$iNumExt$sEmExt"
creates the name for the new .shtml page. The global variables
($sEmRootName, $iNumExt, and $sEmExt) are defined
at the top of the Contacts.cgi program's GLOBAL VARIABLES
section.
The next line assigns the variable $sHtmlFullName with
the variable $sWrkHtDocs, a slash, and the variable $sHtmlName.
The variable $sHtmlFullName defines the full path name relative
to the CGI directory of the .shtml being created. Variable
$sWrkHtDocs defines the directory where the modifiable HTML
documents are stored and accessed by the uninvited spambot.
The example source code directory ../www/sa is where I
store and view the HTML documents. The variable $sEmRootName
contains the first part of the atomic file name being created. In
the example, saelist1_ will be the name of the first part
of the document. The variable $sEmRootName is configurable
and can be assigned another value if desired.
The second variable of the $sHtmlName assignment is $iNumExt,
first seen in function clGenMailAddrs. $iNumExt contains
the value of the last random value generated for the second part
of the email account name. I choose to use the last email address
number simply to avoid another call to the function "rand".
The last variable in the assignment is $sEmExt. This is a
global variable, set to .shtml, assigned at the top of the
program.
The directory where pages are created, must have read, write,
and execute privileges set for the world. Otherwise, the spambot
will not be able to create the page. If this happens, the program
dies.
If the open statement fails, call the function die
and the program terminates, leaving the spambot to digest a strange
message but ending any further random email generation. If the open
statement succeeds, allow generation of the content for the new
file and save the information to disk.
Another item of interest concerns the #include virtual
server side include reference. This reference executes the script
Contacts.cgi just as the #exec cgi SSI reference did
in the initial saelist1.shtml page. The difference is the
#include virtual allows passing arguments to the executing
script where the #exec cgi reference does not allow this.
Arguments are passed using the standard CGI method for passing arguments
(e.g., ScriptName?arg1+arg2+arg3, etc.). In this instance,
only one parameter, the $nRunCount is passed to the Spamivore
CGI.
Creating the Next Link
The following lines:
print "<P>";
print "<a href=\"$sHtmlName\">Rest for the Weary!</a>\n";
create the link to the new page just stored to disk.
The variable assigned in clCrNextTrapPage, $sHtmlName,
is again used to create the name of the link. Since the link is
relative to the directory from which the browser is reading, use
$sHtmlName instead of $sHtmlFullName. This portion
of the code dynamically displays to the spambot as the current page
it is viewing.
Functions clAddDelFile and clRmFiles
Now that the next page is created to which the spambot can link,
the question arises, "When do I delete it?" Because of
the slightly awkward way in which physical pages are generated for
the spambot to devour (rather than relying exclusively on generating
pages "on the fly", as is normally done in CGI programs),
I have introduced a new problem of document management. clAddDelFile
is a function that attempts to address this issue.
Here are three issues of concern:
- I do not want to let the spambot generate pages that will eventually
fill up the disk.
- I do not want to delete the page the spambot is currently viewing.
- I do not want to delete the next page the spambot will link
to.
The approach I chose was to create a disk file, named rmlist.txt
(the "Remove List"), listing the .shtml pages the
spambot created. Later, when I'm sure the spambot is finished
accessing the files, the program removes them. At this point, I
still haven't addressed when to perform the actual delete.
There were two approaches I could have used to attack this problem:
- Delete the documents using a second process such as a background
process
- Delete the pages while the spambot is still in the trap.
I opted to devise a way to delete the pages while the spambot
is still in the trap. This is where I make use of the "Run
Count." The "Run Count" tells us when it is safe
to start the document purge process.
The first time the spambot enters the trap, the run count, $nRunCount,
is zero. The first time the CGI executes, no arguments exist, and
the run count is not incremented. The next time the script executes,
it is done by linking to the page created in which the $nRunCount
is passed within the SSI reference. In this instance of the script,
detect the argument and increment the Run Count by 1. Now our run
count is 1. The third time the script executes, the Run Count is
2.
When the run count is two, it's probably safe to purge the
documents the spambot created. By this time, I have already created
one file ready to be deleted. However, we must not delete the document
the spambot is currently viewing or the next one it is about to
link to.
The clAddDelFile and clRmFiles functions implement
the above requirements just described. The code for clAddDelFile
(see below) assigns the $HtmlName to the $sDelFile
variable. Next, create the delete path variable $sDelPath
composed of $sDelHtPath and $HtmlName.
The variable $sDelHtPath is the same value as the value
for the variable $sWrkHtDocs except it contains a slash at
the end of its definition. Because multiple platforms are maintained,
$sDelHtPath is needed to manipulate the directory delimiters
based on what platform Spamivore is installed on. (e.g., UNIX uses
"/", while Windows/NT uses "\"). This could
have been automated by using the Perl languages' s///
function.
After creating the $sDelPath, check the Run Count. If the
count is greater than 2, call the clRmFiles function to start
the purging process. Otherwise, only add the file to the Remove
List file. The global variable $sRmList is used to define
the file name for the Remove List. Whether or not the clRmFiles()
function is called, always add the file to the Remove List.
The calls to clLockFile and clUnlockFile prevent
contention problems if more than one spambot simultaneously accesses
the script. I only want one spambot at a time accessing the file
operations that are taking place. The functions were written with
portability in mind, thus, the references to WIN32. The lock routines
are very crude, using a disk file to implement the locking mechanism.
Lock files are not as reliable as a semaphore in memory but they
suffice. See Listing 7 for the source code for clAddDelFile
function.
When the clRmFiles function invokes, the function assigns
the $nRmCount to zero, opens the Remove List file, and begins
deleting the files. However, I do not want to delete the last file
in the list because this is the one the spambot is currently viewing.
It may help you to visualize why I don't want to delete the
last file by tracing through our algorithm. If you do, it will become
clear that by the time I enter the clRmFiles function, the
last file added to this list (in the prior run of Contacts.cgi)
should be the one the spambot is viewing. Also, notice the link
file just created has not yet been added to the Remove List. Thus,
not removing the last file allows the page the spambot is viewing
and the next page that was just created to remain on disk (while
all others are deleted).
The foreach loop implements the ideas of the previous two
paragraphs. By using the $nRmCount (the Remove Count) variable,
I keep track of the iterations. For the first iteration, I do not
delete the first file read. I delete this file in the second iteration.
When in the last iteration of the loop, the previous file read in
the prior iteration is removed, but the last file read is left to
remain on disk. The function unlink is used to do the actual
file deletion. See Listing 8 for the source code for clRmFiles
function.
After the foreach loop completes, check whether the $bRmAll
flag has been set. If it has, delete the last file. I have left
this in as hook for future use where I may want to delete all files
when the $nRunCount variable is 0. This is a condition (assuming
that only one spambot is running the script at a time) where it
will be safe to purge the files listed in our Remove List (including
the last file).
If the $bRmAll flag is not set, write the last file not
deleted to a new temporary Remove List. Before leaving the clRmFiles
function, delete the old Remove list and rename the temporary file
as the new Remove List. When clRmFiles exits, return to clAddDelFile,
where the new link file that was created in clCreateNextTrap
page will be added to the new Remove list.
The new Remove List now contains two files -- the file not
deleted and the file for the new link just created. In addition
to updating the new Remove List, I also have a feature that will
log the page created for the spambot along with its remote IP address.
If logging is turned on, write this information to the file named
by the variable $sLogFile (which in the example is set to
contacts.log). The logging feature is useful for keeping
track of what (potential) spambots have visited the trap.
Installing and Running Spamivore
Configuration Settings
The primary configuration variables most likely to modify are:
- $sWrkHtDocs -- Used to specify the public directory
from where the .shtml pages will be created and deleted.
In the example, this is set to ../www/sa directory
- $sDelHtPath -- This will be the same value as $sWrkHtDocs
on UNIX systems with an additional "/" added to the
end of the value. Because I maintain Web sites on more than one
platform, I need to specify a different path structure when deleting
files.
- $sLogDir -- This is used to specify the public
logging directory. This directory will be used to house the Remove
List and the contacts log. In the example, this is set this to
../public.
Optional Configuration Variables
- $sRmlist -- Used to define the name of the remove
list. In the example it is set to $sLogDir/rmlist.txt.
- $sNewRmList -- Used to define the name of the temporary
Remove List file created in clRmFiles.
- $sLogFile -- Used to define the name of the log
file. In the example, this is set to $sLogDir/contacts.log.
- $sRmLockFile -- Used to define the name of the
lock file.
- $nLockSleep -- Used to tell the lock function how
long to sleep (in seconds) before checking if the file resource
is available.
Installation Instructions
- Install the main Web page (index.shtml) in the target
directory to be protected from spambots. In the example, I put
this in the root directory. Note: In the source delivery, the
file is named samain.shtml. This can be renamed as index.shtml
and installed as described above.
- Include a robots.txt file in the root directory of the
Web site. See the section robots.txt below for more information
about what this file is and how it should be used.
- Create the public logging directory with read, write, and execute
access for the world. The logging directory is where the RmList.txt
and contacts.log files reside. In the example, this in
the "public" directory.
- Create the directory where the Spamivore .shtml files
will be created and destroyed. Again, this directory must have
world read and write privileges. In the example, this was the
www/sa directory. If the Apache Web server is used, this
directory will most likely be htdocs/sa.
- Install the initial saelist1.shtml in the Spamivore
.shtml directory.
- Install Contacts.cgi in the cgi-bin directory.
To run Spamivore:
- Point your browser at the index.shtml just installed.
- Find the hidden link within index.shtml (the link to
saelist1.shtml).
- Click on the hidden link. You should see the same email addresses
and links the spambot sees.
Robots.txt
The robots.txt file informs robots of what files and directories
to avoid on your site. Place the file in the Root Document directory.
Include in this file references to any file or even whole directories
containing critical email address listings. In my robots.txt
file, for example, I include the pages that contain email addresses
as well as the directory where the Spamivore SSI resides (in the
directory "sa").
Good robots read the robots.txt file and avoid areas with
references to the names specified, thus, saving them from the trap.
Bad robots are less likely to observe robots.txt. However,
documentation at the Wpoison Web site indicates that even spambots
are honoring this rule now. This means a chance exists that just
putting an entry in robots.txt is enough to keep spambots
from sniffing through your email addresses. A backup precaution
of including a spambot trap will make your site fully protected.
Below is an example robots.txt file:
#
# FILE NAME: robots.txt
#
User-agent: *
Disallow: /sa/
Disallow: /mymaillist.html
For more information on robots.txt, see:
http://info.webcrawler.com/mak/projects/robots/exclusion.html
Conclusion
I have presented one version of a spambot trap to help in the
fight against spam. The script may interest Web masters who maintain
Web sites on remote hosting services where they do not have direct
access to the Web server configuration files. It may also interest
Web masters who do not want to maintain non-standard "cgi-bin"
directories (which would be required to hide Wpoison type CGI scripts).
Although I have demonstrated that Server Side Includes do offer
an alternative way to hide CGI execution, my solution presents some
challenging hurdles. Document Management and concurrent processing
become major concerns. The Spamivore algorithm works best when only
one spambot at a time is running on the installed Web site.
Spamivore assumes that the spambot will execute the script at
least three times before document management takes place. Both these
areas can be improved. For example, add code to the script that
deletes all documents the first time the script runs.
Spamivore will not end spam. If spambots never execute the script
because they honor the robots.txt standard, the script will
not make much impact. However, along with other anti-spam mechanisms,
Spamivore is one more weapon to add to the arsenal in the never-ending
war against spam.
References
1. http://w3.one.net/~banks/spam.htm -- "Beating
the E-Mail Spammers" by Michael A. Banks
2. http://www.e-scrub.com/wpoison/ -- Wpoison home
site.
3. http://www.turnstep.com/Spambot/lure.html -- "Spambot
Beware" by Greg Sabino Mullane.
4. http://info.webcrawler.com/mak/projects/robots/ \
exclusion.html -- Robots exclusion page.
5. http://hoohoo.ncsa.uiuc.edu/docs/tutorials/includes.html
-- NCSA httpd Server Side Includes (SSI) Page.
Charles Leonard is a Software Engineer for eLabor.com (formally,
jeTECH DATA SYSTEMS), where he has worked for the past six years.
Prior to eLabor.com, he worked five years as a Software Engineer
on the Deep Space Network at Jet Propulsion Laboratory in Pasadena,
CA. As a hobby, he also maintains Web sites for friends and family.
He can be reached at: cvl1989@hotmail.com.
|