Parsing RSS Files with XML::RSS

Derek Vadala
Fall, 2002

RSS, or the RDF Site Summary, is an XML document that uses the Resource Description Framework to provide a flexible and easy to use method for syndicating Web site content. The Resource Description Framework (RDF) is an amalgamation of traditional and Web-based data sharing methodologies that uses XML as its delivery agent. RSS is extremely useful because it allows people to share dynamic information like world news, computer security alerts, or a list of restaurant reviews using a common interface and format. So, if you are a content provider, you can use RSS to easily distribute data to users and business partners. If you are an end-user, you can receive easy-to-reformat content that can easily be integrated into your applications and Web pages. The real benefit for both parties is that everyone can view the content using the interface (a PDA, desktop Web browser, or email client) that they prefer.

Today, many Web sites use RSS to share dynamic content with customers, partners, and Web surfers. Web sites including Security Focus (http://www.securityfocus.com), Kuro5hin (http://www.kuro5hin.org), and O'Reilly's Meerkat wire service (http://www.oreillynet.com/meerkat) all use RSS as a method of distributing their content.

RSS has had three notable incarnations including its current version (1.0). The original version (0.9) was developed by Netscape as a mechanism for providing user-tunable content on the My Netscape Network. The subsequent version (0.91) was also created by Netscape and has caused a bit of confusion about the evolution of RSS. That's because, while version 0.91 was also called RSS, it didn't quite follow directly from version 0.9. Also known as the Rich Site Summary, Netscape's 0.91 version implemented new features from Dave Winer's (UserLand.com) scriptingNews format and dumped the RDF schema. This created a fork in RSS development and also left users with a couple of follow-up versions (0.92 and 0.93) that weren't 100% backward, or forward, compatible. Perl's XML::RSS module supports versions 0.9, 0.91, and 1.0, and in most cases can also parse 0.92 and 0.93 documents. For more information about the RSS version history, check out Andrew King's excellent summary at:

http://www.webreference.com/authoring/languages/xml/rss/1/
Inside RSS Files

An RSS file is simply a structured text file that describes dynamic Web content. In addition to a root-level RDF schema (which includes the RSS version number), each RSS file also contains channel information, or meta-data about the Web site that provided, or produced, the content. Some sites might offer multiple RSS feeds organized by topic, while some might offer only a single, chronological feed. The channel element always contains the following sub-elements:

Title The title of the channel/Web site.
Link The home page for the channel.
Description A description of the content provided.
Items A table of contents for the items in the RSS file.

Each item listed in the channel's table of contents also contains an item entry with the following sub-elements:

Title The title/name of the item.
Link The item's URL.
Description A description of the item.

Although both title and link are required for each item, the description is optional. RSS also provides optional image and textinput elements. The image element provides a way to associate an image with the RSS content, usually a Web site or corporate logo. An image element contains the following sub-elements: title, URL, and link. In this instance, the URL specifies the location of the image, and link specifies the image's target link.

Finally, the textinput element provides a way to submit data to the content provider. In many cases, this is used to provide a "searchbox" capability, but is also used to provide services like newsletter subscriptions. A textinput element contains a title, description, name, and link; name specifies the input field's name in the context of an HTML form.

For more details about each of these elements and some additional optional elements, refer to the RSS 1.0 specification at:

http://groups.yahoo.com/group/rss-dev/files/specification.html
or the RSS 0.91 specification at:

http://backend.userland.com/rss091
Installing the XML::RSS Module

It's probably already obvious that Perl is well suited to handling RSS documents. Using just a few Perl modules, it's easy to create a simple, yet effective script that can aggregate dynamic content from a variety of Web resources. The most important of these modules is XML::RSS, written by Jonathon Eisenzopf and Rael Dornfest. In addition to using XML::RSS to parse and reformat RSS files, as I'll illustrate in this article, XML::RSS can also create RSS files. So, it's useful for generating dynamic content as well as collecting it.

XML::RSS relies on Expat, a free XML Parser library. Download Expat from http://expat.sourceforge.net and install it on your system, if it's not already installed. You should be able to download a package file from your distributor. In that case make sure to install both the expat and the expat-devel packages.

The easiest way to install the XML::RSS is using the CPAN module. The code that I have included also relies on libwww-perl (LWP). So, if it's not already installed you should install libwww-perl as well:

# perl -MCPAN -e shell
cpan> install LWP
cpan> install XML::RSS
If the automated method of module installation fails, download, compile, and install XML::RSS from:

http://www.cpan.org/authors/id/E/EI/EISEN/XML-RSS-0.97.tar.gz
libwww-perl:

http://www.cpan.org/authors/id/GAAS/libwww-perl-5.64.tar.gz
See the README file included with each module for installation help.

rss2html.pl

The rss2html.pl program that I have included here illustrates an easy application of XML::RSS's parsing capability. It retrieves an RSS file from the Web, converts it to a list of HTML links, and stores the results in a file.

 1 #!/usr/bin/perl
 2
 3 # Set for your system.
 4 $path = "/usr/local/www/news";
 5
 6 # Add additional rss_parse lines as needed.
 7 rss_parse("brunching.html", "http://www.brunching.com/brunching.rss");
 8 rss_parse("certcc.html", "http://www.cert.org/channels/certcc.rdf");
 9 rss_parse("pigdog.html", "http://www.pigdog.org/pigdog.rdf");
10
11 sub rss_parse {
12
13         use LWP::UserAgent;
14         use HTTP::Request;
15         use XML::RSS;
16  
17         my ($url, $ua, $req, $res, $rss, $item);
18         ($file, $url) = @_;
19  
20         # Fetch our URL, or exit if there's an error.
21         $ua = LWP::UserAgent->new;
22         $req = HTTP::Request->new(GET => $url);
23         $res = $ua->request($req);
24         return if $res->is_error();
25  
26         # Parse the RSS file (URL)
27         $rss = new XML::RSS;
28         $rss->parse($res->content);
29  
30         open RSS, ">$path/$file";
31         foreach my $item (@{$rss->{'items'}}) {
32
33 print RSS qq(<br>&#149&nbsp<A HREF="$item->{'link'}"
target=_blank>$item->{'title'}</A>\n);
34         }
35         close RSS;
36 }
I prefer to keep all of my parsed RSS files in the same directory, which I've specified using the $path variable (line 4). This makes it easy for me to include them in my own Web pages. Next I call the rss_parse subroutine with a filename and URL (lines 7-9). You can add as many entries as you like to retrieve and reformat the RSS feeds you desire.

All of the major work takes place in the rss_parse subroutine. First, rss_parse retrieves the RSS file located at $url using LWP::UserAgent and HTTP:Request (lines 20-24). Of particular note is line 24, which provides some error checking for the HTTP request. If the RSS file can't be retrieved, or an error is encountered (404 Document Not Found, for example), the function exits. LWP's is_error function is useful for developing additional error checking elsewhere in your program. In this particular case, I used it because I don't want to overwrite an already existing HTML file that was successfully written during a previous execution of rss2html.pl. Let's say that I'm grabbing RSS feeds from a dozen or more Web sites, and one of them becomes unavailable for a couple of hours. In that case, I'll simply be left with a file that's a couple of hours old, because is_error returns from the rss_parse subroutine before the existing file can be overwritten.

Next, I create a new RSS object ($rss) and parse it using XML::RSS's parse() function (lines 27-28). This populates several hashes in my $rss object including $rss->{channel}, $rss->{image}, $rss->{textinput} and an array of hashes, $rss->{items}. Each of these hashes corresponds to a top-level element from the RSS specification and contains the corresponding sub-elements.

After opening my output file (line 30), I use a foreach loop to iterate through all item elements in my $rss object (line 31). Remember that each element of the array $rss->{'items'} actually contains a hash representing the item's sub-elements. So, for each $item created using the foreach loop, I am left with the hash items: $item->{'title'}, $item->{'link'}, and the optional $item->{'description'}. Generating HTML from these RSS items is now as simple as reformatting them.

In rss2html.pl, I have opted to create simple bulleted lists of links (line 33). You can replace the HTML on line 33 with any tags that suit your needs using the title, link, and description hash elements as needed. It's also possible to include information about the channel. Adding the following lines just before the foreach loop will add a channel title and description to each HTML file created:

print RSS qq(<b>$rss->{'channel'}->{'title'}</b><br>);
print RSS qq(<i>$rss->{'channel'}->{'description'}</i><P>);
Remember that $rss->{'channel'} is a hash that contains elements in accordance with the RSS specification. You can look at the POD for XML::RSS (perldoc XML::RSS) for more information on the composition of an $rss object.

I use server side includes to load the HTML files that rss2html.pl creates into table cells, but you could also use them as content for a frame or pop-up window. In fact, you don't need to generate HTML at all. You could optionally insert the data into an SQL database or Berkeley DB file.

Once you've modified the output format to meet your needs, you will have to decide how to best automate rss2html.pl. I prefer to run it as a cron job every 15 minutes. Just create a crontab entry like the following:

5,20,35,50 * * * *  /usr/local/httpd/scripts/rss2html.pl
If you are going to download multiple RSS files from the same site, it's polite to wait a few seconds between each pull. For example, sites like Security Focus (http://www.securityfocus.com) offer various categories, and provide a CGI interface for automatically generating customized RSS files. In a case like this, you should add a sleep command between each call to the rss_parse subroutine. Just add the following line at the end of rss_parse (after line 35):

sleep 3;
You might also want to provide a file name and URL on the command line, instead of in the rss2html.pl code. Replace lines 3-9 with the following code:

($file, $url) = @ARGV or die("Usage: rss2html.pl <filename> <url>\n");
rss_parse($file, $url);
Now rss2html.pl takes a filename and URL as its command-line arguments. Don't forget to remove the $path variable from the open statement on line 30 -- since we're specifying the filename on the command line, we can include the path there, too.

It should be quite easy to reuse the rss_parse subroutine in other programs either as is or with some minor changes. If you don't want to bother writing content to a file in the rss_parse subroutine, you could remove the open and close statements and pass rss_parse only a URL. Change line 18 so that it only initializes the $url variable:

($url) = @_;
Instead of printing to the file handle RSS (line 33), concatenate the data into a variable so that you can return it for use in other parts of your program:

$html = $html . qq(<br>&#149&nbsp<A HREF="$item->{'link'}" \ 
  target=_blank>$item->{'title'}</A>\n);     
Resources for RSS Content

Programming with XML::RSS is easy when compared to finding reliable RSS feeds. xmlTree (http://www.xmltree.com) has a decent listing of free RSS feeds and it's a good starting point. Many Web sites offer RSS feeds, but you'll have to do a little bit of digging to discover the URL. There are a couple of tricks, too. As a general rule, sites that run SlashCode provide the file domainname.rdf, by default. So, for Slashdot, download http://www.slashdot.org/slashdot.rss. Sites running Scoop export the file backend.rdf. On Kuro5hin, it's http://www.kuro5hin.org/backend.rdf, for example. PHP-Nuke sites use backend.php by default, http://phpnuke.org/backend.php, for example.

Another recent idea that has been making the rounds on various blogs and RSS-related forums is using the HTML LINK element to point to a site's RSS feed from the HTML index page. This idea was originally postulated by Bill Kearney back in 2001, but interest resurfaced in June 2002 when Matt Griffith (http://matt.griffith.com) proposed the idea again. The concept is actually quite simple. Webmasters could add a relative link into their root HTML document referencing the site's RSS feed. For example:

<link rel="ALTERNATE" type="application/rss+xml" title="RSS"
href="http://www.userland.com/xml/rss/xml">
Sites using the HTML LINK element to point to their RSS feed make life much easier for users wanting to access their content, because it eliminates the arduous process of searching for the RSS feed. I encourage anyone running a site that provides RSS content to follow this convention. A lot of sites have already implemented the HTML LINK, so it should be easy to find syndicated content by simply visiting sites containing content you are interested in and viewing the HTML source of the front page. For more information about using the HTML LINK element with RSS feeds, take a look at Ben Hammersley's Weblog entry on the topic:

http://www.oreillynet.com/cs/weblog/view/wlg/1475
I must confess that it would be nice to see future support for this in the form of an LWP function.

While there are many resources on the Web that provide free RSS content, most of these providers require that you follow a couple of rules if you're going to use their content free of charge. First, it's a generally accepted principle that you should provide a link back to the original site from your own when using content from another site. It's also a good idea not to repeatedly hammer remote sites. Content doesn't change that often. A cron job that updates local data every ten or fifteen minutes is more than adequate for keeping your site up to date. Many sites also have special rules for commercial reuse. So, be certain to check with individual Web sites when using their syndicated content.

Derek Vadala (derek@cynicism.com) is a Linux and Security consultant living in New York City. He is th author of Managing RAID on Linux published by O'Reilly & Associates.