Does SOAP Suck? (Web Services, Google, and the What-Sucks-O-Meter)

Dan Brian
Month, Year

This article is essentially a follow-up to my article "Parsing Natural Language" (The Perl Journal, Fall 2000, #19), which described ways of using a Link Grammar Parser to determine which object in a sentence "rocks" or "sucks". This article is not about SOAP or Web Services, per se. It's about a program that builds on the grammar parsing to find what the Internet public has to say about certain topics. The program does provide some insight to the rationale for Web services, and some details on using them in a minor setting. I'll elaborate.

Confessions of a Content Hoarder

I've been writing automated content-grabbers for a while, before LWP existed. Sending GET requests to stock and weather services, search engines, and reference sites with only sockets was something everyone was doing:

$sockaddr = 'S n a4 x8';
($name,$aliases,$proto) = getprotobyname('tcp');
($name,$aliases,$port) = getservbyname($port,'tcp');
($name,$aliases,$type,$len,$thisaddr) = gethostbyname('hostname');
($name,$aliases,$type,$len,$thataddr) = gethostbyname('localhost');

$this = pack($sockaddr, $AF_INET, 0, $thisaddr);
$that = pack($sockaddr, $EF_INET, $port, $thataddr);

unless (socket(S, $AF_INET, $SOCK_STREAM, $proto)) { die $!; }
unless (bind(S, $this)) { die $!; };
unless (connect(S, $that)) { die $!; };
select S;
print S "GET /\n";
while (<S>) { print; }
Notice there's no use strict. No use, for that matter. Pattern-matching the response strings to dig the pertinent info out of badly formed HTML was something of a pastime for programmers of IRC bots, CGI scripts, and natural language dabblers. If you're wondering why I'm speaking in the past tense, you should give CPAN a visit. Things have changed.

Later, libwww allowed us to do this much more easily, and LWP::Simple further shortened our code:

use LWP::Simple;
$content = get("http://www.google.com");
And, of course, HTML::Parser helped us extract content without using ugly regexes. And now, all you really need for lightweight jobs is WWW::Search, regularly updated to do its own parsing of the returned HTML:

use WWW::Search;
my $search = new WWW::Search('AltaVista');
my $query = "rules";
$search->http_proxy('http://proxy.office.orem.verio.net:8888');
$search->maximum_to_retrieve(3000);
$search->native_query(WWW::Search::escape_query($query));
while (my $result = $search->next_result())
{
  print $result->url, "\n";
}
As most hoarders know, Google and some other search engines disallow such automated crawling, because not only is it intrusive, but it can also violate terms of use. To accommodate the curious, Google has provided an alternate means of retrieving this type of content, which I will discuss momentarily.

Why are we using HTML again?

At some point in time, someone started wondering why we were fetching and parsing HTML, rather than just being able to get the data directly. Actually, transfer of pure content over the Internet wasn't anything new, but HTML *did* provide a modicum of format organization, which allowed a programmer to know where to find the desired content. XML ("simple-SGML") gave us a data-independent means of formatting without extraneous display markup, making it easy to distribute content outside of HTML. Of course, Netscape's RSS format:

http://my.netscape.com/publish/formats/rss-spec-0.91.html
was instantly popularized as the standard way to publish channels of Web content.

As soon as people began publishing content via XML for news feeds, Web loggers, and so on, they realized that working with data outside of display markup was beneficial not only in publishing, but in processing also. I know of more than a few companies with reseller channels who have wanted to automate their ordering processes. To do so, many have written scripts that automate the process of posting (as in HTTP POST) a Web form to the order-processing site. This got the job done, but required the resellers to understand the proprietary format of the Web form, and parse the returned HTML. Error handling was difficult, enhancements or requested changes required reworking of several layers of programs, and parsing anything but flat data structures was cumbersome, if not impossible. The earth is not flat.

Web Services

So, rather than having everyone just request XML for inclusion in Web sites, we wanted to submit XML for processing by servers. This was the rediscovery of Remote Procedure Calls, now wrapped up in the pretty standards of XML. XML-RPC and SOAP are separate incantations of this "new" technology, which in many ways has taken us full circle, this time bringing XML along for the ride.

"Web Services" is the industry's fancy term for using XML for RPC, in some cases allowing objects themselves to be fully distributed. I'm not going to go into too much depth on the specifics of distributing objects, because the process is basically the same for all remote procedures. The difference is only in the program interface used. All Web Services allow data to be transferred from one location to another via XML, although the transfer mechanisms differ.

The simplest example of calling a Web service over, say, HTTP (an established protocol) would be to have the Web server map functions to URI namespaces:

http://brians.org/AuctionCatalog/
Now, a GET request sent to this path could result in the response:

<AuctionCatalog>
  <Item>
    <Title>Dan's Shoes</Title>
    <Price>20.50</Price>
  </Item>
  <Item>
    <Title>Placemats</Title>
    <Price>14</Price>
  </Item>
</AuctionCatalog>
The service could include regular updates to their database or Web site from my auction. If someone were reselling my personals via their own Web site, they could take bids there and automate the posting of the bid to my site with a PUT or POST request, containing XML describing the bid (minus appropriate commission fees, of course). The server would have to process the incoming XML appropriately. This is a Web Service, albeit a primitive one. I should mention that there are a few who would consider this implementation superior to those that conform to SOAP, either for its simplicity, use of existing technology, or relatively low overhead. I don't.

Building the What-Sucks-O-Meter

This discussion does have something to do with Web Services, but before making the connection, I am going to describe the What-Sucks-O-Meter. In TPJ #19, I suggested a (not foolproof) way to determine, within a sentence, the object of verbs like "rocks" and "sucks" using Lingua::LinkParser and some creative regexes. This code is contained and commented in the program that accompanies this article, but I won't elaborate on it here.

The plan is this: to create a program that can gather sentences from the Web pertaining to what items are said to "suck" or "rock", and tally these results for graphing. When the object of an opinion is known, we can determine how many references exist for that object -- simply search for the name followed by "rocks" or "sucks". Since our interest is a bit more general, we will need to search using only the terms "rocks" and "sucks", as well as their plural forms "rock" and "suck". The results of those searches will then be used to retrieve the Web pages containing the terms. Next, the sentence in question will be parsed to determine the object of the verbs, if in fact they are verbs and have objects (as opposed to "He studies rocks"). The number of times each term is used in context can then be counted.

Of course, getting a meaningful number of results this way could get expensive very quickly. Additionally, most search engines do not actually allow you to view all the results for a given search. Try searching for a term like "rocks", and you will find that the claimed total number of results will be much larger than the number you are allowed to view. I'm told that this is to limit disk space, but I can't help wondering whether the numbers are inflated. If they aren't, the URL at least would need to be stored to avoid duplicate counts, and the link could be shown (without a content excerpt). Of course, not many people will look for the 10,000th result of a search. In any case, a better option than retrieving and parsing each referred document would be compiling an initial list of topics, say, the first 1000, and retrieving and parsing the source document to build a topic list. Once we have the list, we can submit queries for each in context:

"PCs suck"
"PC sucks"
"PCs rock"
"PC rocks"

This will give us the total number of results for each, giving us more data to work with, and saving us lots of expensive processing.

SOAPy Googleness

To retrieve the initial list of URLs, we will use the SOAP interface provided by Google. SOAP queries to Google must be made using the Google API, with a registration key provided by Google. Both are freely available from:

http://www.google.com/apis/
but will limit you to 1000 requests per day. The API package includes Java and .NET examples, as well as a WSDL file that can be used with SOAP::Lite. WSDL stands for Web Services Definition Language, and takes the form of an XML document containing a description of a Web Service: its types, functions, and so on. For example, the GoogleSearch.WSDL file includes the following section, which describes the SOAP operation "doGoogleSearch". For Perl and most other languages, this translates to a method called "doGoogleSearch":

<operation name="doGoogleSearch">
  <input message="typens:doGoogleSearch"/>
  <output message="typens:doGoogleSearchResponse"/>
</operation>
The input message defines the arguments the method will take:

<message name="doGoogleSearch">
  <part name="key"            type="xsd:string"/>
  <part name="q"              type="xsd:string"/>
  <part name="start"          type="xsd:int"/>
  <part name="maxResults"     type="xsd:int"/>
  ...
</message>
WSDL is part of what really makes SOAP appealing to me. With it, you can describe all of the functionality that a service provides and distribute the file to developers using (in theory) any language with a SOAP interface. (WSDL is not unlike CORBA IDL, but is much more verbose. Surprised?) From Perl, calling this and the many other functions provided by the Google API is trivial using SOAP::Lite:

use SOAP::Lite;
my $googleSearch = SOAP::Lite -> service("file:GoogleSearch.wsdl");
Assuming the WSDL file is in the same directory as this script, this is all the code needed to build a distributed Google API object. $googleSearch now provides the methods described in the WSDL file (and in Google's large API reference document) for searching, retrieving copies of documents cached by Google, and sending spelling correction requests. For example, to retrieve the first 10 results of a search for "sucks":

my $key = '000000000000000000000000';  # key from api.google.com
my $query = "sucks";
my $results = $googleSearch->doGoogleSearch($key, $query, 0, 10, \
    "false", "", "false", "", "latin1", "latin1");

foreach my $result (@{$results->{'resultElements'}}) {
  print "$result->{title} ($result->{URL})\n";
}
This outputs:

School <b>Sucks</b> (http://www.schoolsucks.com)
AOL Watch (http://www.aolsucks.com)
Operating System <b>Sucks</b>-Rules-O-Meter (http://srom.zgp.org)
PlanetSocks.com (http://www.survivorsucks.com)
Fencing <b>Sucks</b> (http://www.fencingsucks.com)
Already we can see that these searches will yield useful information for our purpose. In some cases, the content excerpt (in $result->{snippet}) or even the title ($result->{title}) returned by the API will contain enough information for our list. However, the most useful information will be found in the content of the referenced Web page.

Retrieving the Pages

As Google has cached copies of the pages it indexes, we could just use SOAP to get the entire contents:

$page = $googleSearch->doGetCachedPage($key,$url);
However, given our 1000-per-day query limit, we would be better to grab the source from the original site. Luckily, this isn't much more difficult with LWP::Simple, as was shown earlier. Once we have the page content, we will need to pull out the relevant sentence and parse it.

Extracting Topics

Since search engines rank results starting with URLs, the majority of initial URLs contain valuable information. We can write these out to a filehandle with a crude regex:

$verb = "sucks";
if (/([^\.\/]*)$verb([^\.\/]*)/i) {  
  print "Subject: $1\n"; print SUBJECTS "$1\n";
}
We can then retrieve the content with an LWP::UserAgent request:

my $content = $ua->get($_)->content;
With the page's contents retrieved, Lingua::EN::Sentence provides a clean interface for separating the sentences within the text, after we have removed the HTML elements. There are varied and sundry means of doing this; I find HTML::FormatText does nicely (object $formatter):

$content = $formatter->format(HTML::TreeBuilder->new->parse($content));
my $sentences = get_sentences($content);
Having isolated the sentences, we can iterate through them to find those that contain our "rock" or "suck" keywords. After finding them, we will parse the sentence with Lingua::LinkParser. To simplify the interface for doing this, I have contained the bulk of the code from TPJ #19 in a new class, Lingua::LinkParser::Simple (v1.06), with a single method, extract_subject(). This method takes a sentence and a verb as arguments, and tries to return the noun or noun phrase that is the object of said verb:

my $subject = extract_subject($sentence, $verb);
In the final program, all of the above steps -- listing URLs, retrieving the content, and parsing out the subject -- have data saved to plaintext files between steps. This allows some steps to be skipped if the files already exist, or rebuilt if the files are deleted.

For the curious, the lists at this point include:

sucks rocks
java atlanta
O'Reilly (bill) technology
monsanto blue
marriage shatner
extinction snowboarding

Using these terms to then query in the context of "sucks" or "rocks" is obviously very prone to error. Consider that a query for "o'reilly rocks" does little to disambiguate the subject. It's likely that most "rocks" opinions expressed about O'Reilly concern not only the political show host, but also the book publisher. Conversely, "o'reilly sucks" will probably return opinions that concern mostly the TV personality, since I only know a few people who carry strong negative opinions about technical book publishers. And I walk in geek circles. The method is far from ideal, and I expect you will overlook that.

Querying for Totals

The final step to gather numbers is to make subsequent queries for each term in the subject lists, putting them in the context of the verbs. The total results are stored as values in DB files, with subject terms as the keys:

tie my %results, 'DB_File', "$verb-results.db", O_CREAT|O_RDWR, 0666, $DB_HASH;
while (my $subject = <SUBJECTS>) {
  chomp;
  my $results = $googleSearch->doGoogleSearch($googleKey, "'$subject \
      $verb'", 0, 1, "false", "", "false", "", "latin1", "latin1");
  $results{lc($subject)} = $results->{'estimatedTotalResultsCount'};
}
The end result is several DB files, containing the total result counts for each subject. These can then be sorted and printed.

All Opinions Are Not Created Equal

All good statistical processing incorporates some adjustments to reflect the reliability of the data. I'm not pretending that we are doing such processing here, but I couldn't help wonder about the reliability of these results. How could I bring some "smoothing" to the results? By addressing some personal biases, of course.

The "exclamatory adjustment" used in the What-Sucks-O-Meter causes exclamation points at the end of a sentence to cumulatively count against the reliability of a stated opinion. Also, a successful grammatical parse of the sentence should counts toward the reliability. (Subjects can be extracted even from sentences that don't have "clean" link parses.) I considered using proper spelling, use of capitals, and the lexical familiarity of words to further influence the numbers. These will be implemented in time.

To see the effect of these adjustments, every matching document must be retrieved in order that the text may be processed. Since we've already identified this as an overly expensive process, it makes sense to do this for only a few key subjects:

foreach $subject (@subjects) {
  my $content = get($_);
  if ($content) {
    ... parse sentences ...
    ... extract subject ...
    if ($sentence =~ /(!+)\s*$/) {
      $adjustment{$subject} =- length($1);
    }
  }
}
I am still working to implement this feature, as it is difficult to get search results for queries that include exclamation points.

Output

A simple sort on the tied hash will let us see the list ordered. I find the ordering to be most informative if the sucks results are subtracted from the rocks results, and listed in order of suckiness:

tie my %smatches, 'DB_File', "sucks-results.db", O_CREAT|O_RDWR, 0666, $DB_HASH;
tie my %rmatches, 'DB_File', "rocks-results.db", O_CREAT|O_RDWR, 0666, $DB_HASH;
foreach my $key (keys %smatches) {
  $indexed{$key} = $rmatches{$key} - $smatches{$key};
}
foreach my $key (sort { $indexed{$a} <=> $indexed{$b} } keys %indexed) {
  next if ($key =~ /^\s*$/);
  print FILE "  <tr>\n";
  print FILE "    <td>$key</td><td>$smatches{$key}</td><td>$rmatches{$key}</td><td> \
    $indexed{$key}</td>\n";
  print FILE "  </tr>\n";
}
Here's an abbreviated example of the program's output:

subject sucks rocks sum
life 43000 1210 -41790
school 13900 871 -13029
microsoft 13800 994 -12806
windows 10700 195 -10505
aol 9030 53 -8977
windows 95 8880 5 -8875
work 8780 434 -8346

The regularly updated version can be seen at:

http://whatsucks.brians.org
with the aforementioned adjustments in progress.

Conclusion

One thing I was curious about was the performance of the Google API, because I had heard some terrible things about SOAP benchmarks. On a high-speed connection, I experienced performance equal to any I had seen with LWP. But since I thought the API would let you determine the number of returned results (rather than iterating through pages of results with LWP), I thought it would prove faster for large queries. Unfortunately, the API only allows 10 results to be retrieved at once. This really detracts from the otherwise high value offered by Google's service.

I should note that Google's SOAP interface uses what amounts to a request-response model, where most methods return pure data structures. I think this is entirely appropriate for the API. It reduces quite a bit of overhead that would otherwise be involved if all returned results were treated as objects, with accessor methods that would pass the object over the wire repeatedly. I've heard some object purists complain about this data-oriented implementation, and I heartily disagree.

So, where does SOAP come out on the list? "Soap", whether the object or detergent type, is well rated with a "sucks" count of 95, and a "rocks" count of 184, for a sum of 89. My experience thus far agrees with this total. Unlike some in the open source camp, I think SOAP has a bright future, and will be instrumental in bringing a standard to the Web's evolving transaction layer.

What about Perl? 329 for, 312 against, for 17 positive votes. My exclusion of "rules" and "stinks" as synonyms for our verbs may account for a result that is different from the Programming-Languages-Sucks-O-Meter.

Anyway, the concepts here can be used in a myriad of applications. Google's API will likely make things easier for many of us, whether dealing with natural language processing, information research, or content inclusion for Web sites.

The full script can be downloaded from:

http://whatsucks.brians.org/whatsucks.pl
Dan Brian is a composer, linguist, gamer, mentalist, and father of two. By day he masquerades at NTT/Verio as a software engineer.