Article
Sidebar

feb2001.tar

Devouring Spam with Spamivore

Charles V. Leonard

How many of us are bombarded each day with undesirable junk email, commonly known as spam? How do you purge your email account of unsolicited spam? This article presents an application called Spamivore. Spamivore attacks spam at its source by disabling the harvesting agents that steal email addresses.

Spamivore consists of a 365-line Perl CGI (Common Gateway Interface) and a 75-line supplemental HTML template used to execute the script. The development platform was a Sun Solaris 2.x, System V, Release 4.0, but it has been ported to Windows 95/98/NT. The Web server used in development of the script is Apache 1.3.9. The browsers used for testing the scripts are Netscape 4.0 and Internet Explorer 5.0.

Spambots vs. Robots

There are many ways spam fills your mailbox. A common method involves spammers stealing email addresses off your Web site by using a spambot. A spambot is a specific form of a broader class of Internet applications known as robots. Robots are software applications running unattended on the Internet collecting and cataloging information about Web sites. Most robots are legitimately helpful applications. A common robot, for example, is a Web crawler. Search engines such as Yahoo!, Altavista, Excite, Lycos, and Google use Web crawlers to update their databases. Web crawlers put your Web site on the Internet map.

Spam, Spam, Spam

Unlike Web crawlers, spambots are malicious entities; the email addresses spambots collect are used to send mostly unwanted material.

How do spammers obtain email addresses? According to Michael A. Banks in "Beating the E-Mail Spammers", doing any of the following will potentially place you on a list:

Post on an online service's or Internet bulletin board
Post in a Usenet newsgroup
Spend time in chat rooms on an online service
Have a listing in an online service's member directory

Perhaps a few spammers collect email addresses manually by going from site to site, but more spammers automate this task by using spambots. Spamivore defends Web sites against these uninvited, intruding spambots. (See the sidebar for more on spambots).

Spamivore Beginnings

I became more than slightly annoyed about all the spam in the world (especially in my POP3 account), so I decided to research what could be done to stop it. One of the applications that kept being referenced at almost every anti-spam site I visited was Webpoison (or Wpoison). I read about Webpoison, and I thought the concept was great.

I decided to create my own Spambot trap when I learned I would need to modify the http.conf file if I wanted to disguise the CGI references as regular links. Because I manage two sites located on Web hosting services, where I do not have the necessary administration rights to make these modifications), I was left to contemplate an alternative plan.

I came up with the idea of using Server Side Includes (SSI) as an alternative way to cloak the spambot trap, which would not require httpd administration rights. I chose the name Spamivore after viewing a local news segment about Carnivore, the FBI's own potentially intrusive application. (Ironically, Carnivore, like spambots, has an interest in email.)

Spamivore Design Overview

The basic design for Spamivore is as follows:

1. Place a hidden link in the main HTML page that contains a link to critical data (email addresses) that I do not want spambots to access.

2. The hidden link will be not be seen by anyone except uninvited robots.

3. Following the hidden link presents a page containing a Server Side Include.

4. The SSI references a CGI script that randomly generates a page full of email addresses, creates another SSI Web page, and creates a corresponding link to this Web page for the spambot to follow.

5. The newly created SSI Web page contains another reference to the CGI script. This CGI reference is never actually seen by the spambot because the http daemon has already replaced it with the content generated by the CGI before it reaches the spambot.

6. The CGI script contains a mechanism to maintain the physical .SHTML (SSI) documents that are created.

7. After the spambot follows the third dynamic link, the CGI script deletes from the system all the .shtml documents created by the spambot (through our CGI script), except for the current one it is viewing and the next one the spambot wants to link to.

8. At this point, there is the possibility that the spambot is trapped and unable to go backwards to intrude on other areas of the Web site. If the Spambot's history cacheing is not very sophisticated (i.e., the Web page must exist on the Web site), the Spambot will not be ble to go back to the previous page because that page no longer exists (as it was deleted in step 7).

9. To protect against good robots from falling into our trap, add an entry to robots.txt that informs the robot to stay away from these areas that have planted traps.

10. As an added safety measure for normal users who accidentally fall into our trap, add a second link below the .shtml link that allows the user to go back to the main homepage

Code Development

The Main Home Page

First, create the main HTML page for the Web site to be protected against spambots. Since I use SSI in this page, the file will end with an .shtml extension (all SSI pages end in this extension). The example site is called "Contact Busters of America". As its name implies, Contact Busters is a fictitious group dedicated to the abolition of intrusive interruptions in life such as junk email, junk mail, telephone solicitation, and other spammish enterprises.

In Listing 1 (the index.shtml for our fictitious Web site), I use SSI to display today's date with the statement . If SSI is enabled, the server replaces this statement with today's date before delivering the page to the client Web browser.

The anchor <a href="sa/saelist1.shtml"> is the link that begins the spambot's journey through the spam trap. Note the <FONT>,</FONT> tags around the word "Contacts". This statement hides the link from visual users. Since the background is black for this area, changing the link font to black will make this link invisible to normal users, but not to spambots. The COLOR="#000000" attribute changes the normal link color to black regardless of whether the user has visited this page before (at least for Netscape Version 4.0 and Microsoft Internet Explorer 5.0).

Although a spambot normally attacks the homepage of a Web site, it is not necessarily the first or only page the spambot visits. Any page containing email addresses should have a hidden link as the first link of the page. Spambots, as most robots, will simply follow the first link they come to unless otherwise restricted. (I will discuss this later in the robots.txt section).

The Initial Spambot Trap Page

As I previously discussed, the Contact Busters main page has a hidden link to sa/saelist1.shtml. The page sa/saelist1.shtml introduces the spambot to Spamivore. The HTML code for saelist1.shtml is described in Listing 2. The statement  instructs the Web server to execute the script Contacts.cgi and replace the statement with any output generated by Contacts.cgi.

Contacts.cgi

When the initial saelist1.shtml is visited, Contacts.cgi executes and displays a page full of email addresses. This is typical of spambot traps, and Spamivore is not any different. This, however, is not all Contacts.cgi does. Because I want to hide that Contacts.cgi is a CGI script, I create physical-looking links to any new page that is dynamically generated; this includes creating links with the actual file extension .shtml.

To do this, physical .shtml Web pages must be created, stored, and maintained on the Web site. When looking at the initial spambot Trap page, saelist1.shtml, it might be overlooked that Contacts.cgi script does not dynamically create the entire content of the Web page you are viewing. Instead, it is using a host physical Web page that already exists on the server and only generates content for a partial portion of that existing Web page.

Maintaining physical Web pages created dynamically by a CGI script presents a whole new problem. After making the choice to create these pages, when do you delete them? This was one of the primary issues when developing the script.

Contacts.cgi Code Review

The Main Function

Contacts.cgi, Listing 3, informs the Web server of what type of content to generate. The statement print "Content-type: text/html\n\n"; tells the server to generate regular text or HTML content. The next two lines are debug statements that allow examination of the argument passed to the script and the IP address of the remote client. The function clDebug() prints whatever is passed to it whenever the global debug variable $bDebug is set.

All debug information is enclosed in HTML comments so the client application ignores them. Typically, most graphical browsers will not display comments, unless the "Page Source" option is selected. Because spambots have access to these comments, turn them off when Spamivore is installed on your site.

The next section of code, beginning with if (@ARGV), checks whether an argument has been passed to the script. A defined argument sets the Run Count. The Run Count is later used in the program to control document purging. Document purging involves purging of .shtml files previously generated and no longer in use by the spambot.

After determining the Run Count, call function clGenMailAddrs(). As the name implies, it generates the random email addresses returned to the spambot. After generating a page full of email addresses, call the clCrNextTrapPage() function.

The clCrNextTrapPage() function creates the next .shtml file I want the spambot to link to. After the new .shtml is created and saved to a public area on the Web site, create a link to the new .shtml page by calling the next two print statements (the paragraph tag of line 29, and the anchor tag of line 30).

Finally, call the clAddDelFile() function. This function adds the new file I have just created to the Remove List. The list is a file stored on the Web site to keep track of the random .shtml files created and to be removed at a later time. The function clAddDelFile() also determines document removal.

Function clGenMailAddrs

This function is responsible for generating the random banquet of email addresses for the spambot. We have a finite set of nine arbitrary email account name prefixes and a set of eight fictitious email provider domain names. The nine email account name prefixes are stored in array @elist1, and the eight fictitious domain names are stored in array @elist2. Both are global arrays defined at the beginning of the program.

The function first seeds the random number generator. Because no argument is passed to random number generator function srand, Perl automatically uses the current system time as the seed.

Next, enter the Until loop that generates 50 randomly created email addresses. Fifty addresses were chosen because they fit in a browser display, and a second link (below the email listings) can be viewed by a typical user without scrolling.

This second link that I want to account for will, of course, allow a viewer to escape from our spambot trap. It also is a reasonable amount of email addresses to contain on a page without tipping off the spambot that the page is a set-up. However, I have no hard statistics on this claim.

The "Until" loop is the basis of our random generator. Each iteration of the loop will creates a new random email address and corresponding "mailto" HTML anchor. The first line within the loop is a counter to keep track of how many phony email addresses are being created. The next line creates a random number with a limit of five digits. The variable $iNumExt is the second part of the name used to create the email account. I will also use $iNumExt later in function clCreateTrapPage as the second part of the .shtml file name.

The statement $iName=((rand) * 10) % 9 creates the index value for the first part of our email account name. The statement $iHost=((rand) * 10) % 8 creates the index value to the name used for the phony domain name of our email account. The print statement that follows the prior three random value assignments creates the actual mailto: account name.

The variable $iName is used to index the array @elist1 (the first part of our email account name), and $iHost is used to index the array @elist2 (the name of our phony account's domain name), and $iNumExt is used as the second part of our email account's name. For example, if $iNumExt were 1215, and $iName were 2, and $iHost were 7, the generated line of HTML code by the print statement would be:

<a href="mailto:sandymims1215@ibm0.com">sandymims1215@ibm0.com</a>

Note the print statements have backslashes preceding each reference to the "@" symbol. In Perl, the "@" symbol is a reserved word used to reference and define array variables. The backslash is used to escape the symbol, so it is a literal reference in string values.

Listings 4 and 5 reference the code used to define the global email arrays and the clGenMailAddrs function.

Function clCrNextTrapPage

The clCrNextTrapPage() function creates the next .shtml page to which the spambot will link. The function creates a page similar to the first initial .shtml page (saelist1.shtml). Recall that saelist1.shtml is the page to which the spambot first links from the hidden link in index.shtml. Listing 6 references the code used to define clCrNextTrapPage.

The first statement, $sHtmlName="$sEmRootName$iNumExt$sEmExt" creates the name for the new .shtml page. The global variables ($sEmRootName, $iNumExt, and $sEmExt) are defined at the top of the Contacts.cgi program's GLOBAL VARIABLES section.

The next line assigns the variable $sHtmlFullName with the variable $sWrkHtDocs, a slash, and the variable $sHtmlName. The variable $sHtmlFullName defines the full path name relative to the CGI directory of the .shtml being created. Variable $sWrkHtDocs defines the directory where the modifiable HTML documents are stored and accessed by the uninvited spambot.

The example source code directory ../www/sa is where I store and view the HTML documents. The variable $sEmRootName contains the first part of the atomic file name being created. In the example, saelist1_ will be the name of the first part of the document. The variable $sEmRootName is configurable and can be assigned another value if desired.

The second variable of the $sHtmlName assignment is $iNumExt, first seen in function clGenMailAddrs. $iNumExt contains the value of the last random value generated for the second part of the email account name. I choose to use the last email address number simply to avoid another call to the function "rand". The last variable in the assignment is $sEmExt. This is a global variable, set to .shtml, assigned at the top of the program.

The directory where pages are created, must have read, write, and execute privileges set for the world. Otherwise, the spambot will not be able to create the page. If this happens, the program dies.

If the open statement fails, call the function die and the program terminates, leaving the spambot to digest a strange message but ending any further random email generation. If the open statement succeeds, allow generation of the content for the new file and save the information to disk.

Another item of interest concerns the #include virtual server side include reference. This reference executes the script Contacts.cgi just as the #exec cgi SSI reference did in the initial saelist1.shtml page. The difference is the #include virtual allows passing arguments to the executing script where the #exec cgi reference does not allow this. Arguments are passed using the standard CGI method for passing arguments (e.g., ScriptName?arg1+arg2+arg3, etc.). In this instance, only one parameter, the $nRunCount is passed to the Spamivore CGI.

Creating the Next Link

The following lines:

print "<P>";
print "<a href=\"$sHtmlName\">Rest for the Weary!</a>\n";

create the link to the new page just stored to disk.

The variable assigned in clCrNextTrapPage, $sHtmlName, is again used to create the name of the link. Since the link is relative to the directory from which the browser is reading, use $sHtmlName instead of $sHtmlFullName. This portion of the code dynamically displays to the spambot as the current page it is viewing.

Functions clAddDelFile and clRmFiles

Now that the next page is created to which the spambot can link, the question arises, "When do I delete it?" Because of the slightly awkward way in which physical pages are generated for the spambot to devour (rather than relying exclusively on generating pages "on the fly", as is normally done in CGI programs), I have introduced a new problem of document management. clAddDelFile is a function that attempts to address this issue.

Here are three issues of concern:

I do not want to let the spambot generate pages that will eventually fill up the disk.
I do not want to delete the page the spambot is currently viewing.
I do not want to delete the next page the spambot will link to.

The approach I chose was to create a disk file, named rmlist.txt (the "Remove List"), listing the .shtml pages the spambot created. Later, when I'm sure the spambot is finished accessing the files, the program removes them. At this point, I still haven't addressed when to perform the actual delete.

There were two approaches I could have used to attack this problem:

Delete the documents using a second process such as a background process
Delete the pages while the spambot is still in the trap.

I opted to devise a way to delete the pages while the spambot is still in the trap. This is where I make use of the "Run Count." The "Run Count" tells us when it is safe to start the document purge process.

The first time the spambot enters the trap, the run count, $nRunCount, is zero. The first time the CGI executes, no arguments exist, and the run count is not incremented. The next time the script executes, it is done by linking to the page created in which the $nRunCount is passed within the SSI reference. In this instance of the script, detect the argument and increment the Run Count by 1. Now our run count is 1. The third time the script executes, the Run Count is 2.

When the run count is two, it's probably safe to purge the documents the spambot created. By this time, I have already created one file ready to be deleted. However, we must not delete the document the spambot is currently viewing or the next one it is about to link to.

The clAddDelFile and clRmFiles functions implement the above requirements just described. The code for clAddDelFile (see below) assigns the $HtmlName to the $sDelFile variable. Next, create the delete path variable $sDelPath composed of $sDelHtPath and $HtmlName.

The variable $sDelHtPath is the same value as the value for the variable $sWrkHtDocs except it contains a slash at the end of its definition. Because multiple platforms are maintained, $sDelHtPath is needed to manipulate the directory delimiters based on what platform Spamivore is installed on. (e.g., UNIX uses "/", while Windows/NT uses "\"). This could have been automated by using the Perl languages' s/// function.

After creating the $sDelPath, check the Run Count. If the count is greater than 2, call the clRmFiles function to start the purging process. Otherwise, only add the file to the Remove List file. The global variable $sRmList is used to define the file name for the Remove List. Whether or not the clRmFiles() function is called, always add the file to the Remove List.

The calls to clLockFile and clUnlockFile prevent contention problems if more than one spambot simultaneously accesses the script. I only want one spambot at a time accessing the file operations that are taking place. The functions were written with portability in mind, thus, the references to WIN32. The lock routines are very crude, using a disk file to implement the locking mechanism. Lock files are not as reliable as a semaphore in memory but they suffice. See Listing 7 for the source code for clAddDelFile function.

When the clRmFiles function invokes, the function assigns the $nRmCount to zero, opens the Remove List file, and begins deleting the files. However, I do not want to delete the last file in the list because this is the one the spambot is currently viewing. It may help you to visualize why I don't want to delete the last file by tracing through our algorithm. If you do, it will become clear that by the time I enter the clRmFiles function, the last file added to this list (in the prior run of Contacts.cgi) should be the one the spambot is viewing. Also, notice the link file just created has not yet been added to the Remove List. Thus, not removing the last file allows the page the spambot is viewing and the next page that was just created to remain on disk (while all others are deleted).

The foreach loop implements the ideas of the previous two paragraphs. By using the $nRmCount (the Remove Count) variable, I keep track of the iterations. For the first iteration, I do not delete the first file read. I delete this file in the second iteration. When in the last iteration of the loop, the previous file read in the prior iteration is removed, but the last file read is left to remain on disk. The function unlink is used to do the actual file deletion. See Listing 8 for the source code for clRmFiles function.

After the foreach loop completes, check whether the $bRmAll flag has been set. If it has, delete the last file. I have left this in as hook for future use where I may want to delete all files when the $nRunCount variable is 0. This is a condition (assuming that only one spambot is running the script at a time) where it will be safe to purge the files listed in our Remove List (including the last file).

If the $bRmAll flag is not set, write the last file not deleted to a new temporary Remove List. Before leaving the clRmFiles function, delete the old Remove list and rename the temporary file as the new Remove List. When clRmFiles exits, return to clAddDelFile, where the new link file that was created in clCreateNextTrap page will be added to the new Remove list.

The new Remove List now contains two files -- the file not deleted and the file for the new link just created. In addition to updating the new Remove List, I also have a feature that will log the page created for the spambot along with its remote IP address. If logging is turned on, write this information to the file named by the variable $sLogFile (which in the example is set to contacts.log). The logging feature is useful for keeping track of what (potential) spambots have visited the trap.

Installing and Running Spamivore

Configuration Settings

The primary configuration variables most likely to modify are:

$sWrkHtDocs -- Used to specify the public directory from where the .shtml pages will be created and deleted. In the example, this is set to ../www/sa directory
$sDelHtPath -- This will be the same value as $sWrkHtDocs on UNIX systems with an additional "/" added to the end of the value. Because I maintain Web sites on more than one platform, I need to specify a different path structure when deleting files.
$sLogDir -- This is used to specify the public logging directory. This directory will be used to house the Remove List and the contacts log. In the example, this is set this to ../public.

Optional Configuration Variables

$sRmlist -- Used to define the name of the remove list. In the example it is set to $sLogDir/rmlist.txt.
$sNewRmList -- Used to define the name of the temporary Remove List file created in clRmFiles.
$sLogFile -- Used to define the name of the log file. In the example, this is set to $sLogDir/contacts.log.
$sRmLockFile -- Used to define the name of the lock file.
$nLockSleep -- Used to tell the lock function how long to sleep (in seconds) before checking if the file resource is available.

Installation Instructions

Install the main Web page (index.shtml) in the target directory to be protected from spambots. In the example, I put this in the root directory. Note: In the source delivery, the file is named samain.shtml. This can be renamed as index.shtml and installed as described above.
Include a robots.txt file in the root directory of the Web site. See the section robots.txt below for more information about what this file is and how it should be used.
Create the public logging directory with read, write, and execute access for the world. The logging directory is where the RmList.txt and contacts.log files reside. In the example, this in the "public" directory.
Create the directory where the Spamivore .shtml files will be created and destroyed. Again, this directory must have world read and write privileges. In the example, this was the www/sa directory. If the Apache Web server is used, this directory will most likely be htdocs/sa.
Install the initial saelist1.shtml in the Spamivore .shtml directory.
Install Contacts.cgi in the cgi-bin directory.

To run Spamivore:

Point your browser at the index.shtml just installed.
Find the hidden link within index.shtml (the link to saelist1.shtml).
Click on the hidden link. You should see the same email addresses and links the spambot sees.

Robots.txt

The robots.txt file informs robots of what files and directories to avoid on your site. Place the file in the Root Document directory. Include in this file references to any file or even whole directories containing critical email address listings. In my robots.txt file, for example, I include the pages that contain email addresses as well as the directory where the Spamivore SSI resides (in the directory "sa").

Good robots read the robots.txt file and avoid areas with references to the names specified, thus, saving them from the trap. Bad robots are less likely to observe robots.txt. However, documentation at the Wpoison Web site indicates that even spambots are honoring this rule now. This means a chance exists that just putting an entry in robots.txt is enough to keep spambots from sniffing through your email addresses. A backup precaution of including a spambot trap will make your site fully protected.

Below is an example robots.txt file:

#
#  FILE NAME: robots.txt
#
User-agent: *
Disallow: /sa/
Disallow: /mymaillist.html

For more information on robots.txt, see:

http://info.webcrawler.com/mak/projects/robots/exclusion.html

Conclusion

I have presented one version of a spambot trap to help in the fight against spam. The script may interest Web masters who maintain Web sites on remote hosting services where they do not have direct access to the Web server configuration files. It may also interest Web masters who do not want to maintain non-standard "cgi-bin" directories (which would be required to hide Wpoison type CGI scripts).

Although I have demonstrated that Server Side Includes do offer an alternative way to hide CGI execution, my solution presents some challenging hurdles. Document Management and concurrent processing become major concerns. The Spamivore algorithm works best when only one spambot at a time is running on the installed Web site.

Spamivore assumes that the spambot will execute the script at least three times before document management takes place. Both these areas can be improved. For example, add code to the script that deletes all documents the first time the script runs.

Spamivore will not end spam. If spambots never execute the script because they honor the robots.txt standard, the script will not make much impact. However, along with other anti-spam mechanisms, Spamivore is one more weapon to add to the arsenal in the never-ending war against spam.

References

1. http://w3.one.net/~banks/spam.htm -- "Beating the E-Mail Spammers" by Michael A. Banks

2. http://www.e-scrub.com/wpoison/ -- Wpoison home site.

3. http://www.turnstep.com/Spambot/lure.html -- "Spambot Beware" by Greg Sabino Mullane.

4. http://info.webcrawler.com/mak/projects/robots/ \
exclusion.html -- Robots exclusion page.

5. http://hoohoo.ncsa.uiuc.edu/docs/tutorials/includes.html -- NCSA httpd Server Side Includes (SSI) Page.

Charles Leonard is a Software Engineer for eLabor.com (formally, jeTECH DATA SYSTEMS), where he has worked for the past six years. Prior to eLabor.com, he worked five years as a Software Engineer on the Deep Space Network at Jet Propulsion Laboratory in Pasadena, CA. As a hobby, he also maintains Web sites for friends and family. He can be reached at: cvl1989@hotmail.com.