Cover V10, I07
Article

jul2001.tar


Monitoring Usenet News

Bill Davidsen

Running Usenet news on a large server has never been easy because the nature of Usenet has been to expand to fill all resources. In the days of dial-up modems when I started running news, the Telebit Trailblazer modem came out with 18-Kb connections. We thought the bandwidth problems were over, because we could handle several MB a day -- now we handle several MB a second. To help run news on the least hardware available, I have generated some tools to help me see what the systems are doing.

There are three major Usenet players in the UNIX world: INN, the oldest and probably most widely used; Diablo, which evolved from a simple multi-threaded transit server to a scalable software for readers and transit; and Cyclone, the only commercial product noted here, originally from Highwinds Software. Each of these includes some reporting tools that can generate daily reports, graphs, and other high-level summaries of what the system is doing.

However, it's sometimes nice to have a report on system activity that falls between a daily summary and the second-to-second detail of the raw log files. Additionally, these tools work well to generate single page snapshots that fit on a slide allowing you to keep management informed.

INN Report Generation Tools

One thing to watch is how much news you get over the course of a day. This small Perl program, feedwatch.pl (Listing 1), produces a table of feed characteristics versus time of day. This can be used to monitor the data flow of the system at regular intervals -- by default, every minute, on the minute. The options available are available by the -? option, as shown in Listing 2. (All listings for this article are available from: www.sysadminmag.com.)

Feedwatch, by default, produces a tabular list of output values (Listing 3) from the "news" log file (by default in the /usr/local/news/log directory), which can be fed into sort, gnuplot, or additional analysis programs. I use this output on a minute-to-minute basis to observe the impact of changes to my incoming feeds, network changes, or the tuning of the machine. The output is from a transit server (sort of a news router), which gets most of a full feed, and, at the time I grabbed this snapshot, was accepting about 2.5 MB/sec.

The example in Listing 3 has the headers enabled (-H) and shown every 50 lines (-L50), with output generated every 3600 sec (-i3600) to show a full day of output. The default is headers every 10 samples and output every minute, on the minute. Without the headers, the output may easily be sorted on any of the fields, or even used to produce plots via gnuplot when the plots produced by the tools included with INN don't have exactly what you need.

Using the Output Data

It's important to note that this output reflects articles actually transferred over the network, not articles that were offered by feeding sites, because articles already seen are rejected. Everything you see as offered, junked, or rejected, is taking your bandwidth.

Some useful fields are the article offer rate (how many articles transferred from your feeds), the accept rate, and the data rate for incoming articles. The most commonly checked value is probably the incoming data rate (art KB/sec). This number will quickly change in response, network or feeding host changes. I refer to this a lot when I'm tuning a server that is having a hard time keeping up with a large feed.

The difference between the offer rate and the accept rate shows how many articles were transferred over the network and then rejected because of unwanted newsgroups or other conditions found by the filter (if you use one). Filtering on this server is minimal, but certain groups are dropped for various reason. If many articles are being dropped, you can be nice to your network by working with feeding sites to tune the feed and simply not send articles you don't want to.

The article size is a useful indicator of the content of the incoming feed. Obviously, if you carry binary groups, you will see a larger average article size than a text-only feed. A sudden change in the article size indicates a feed change -- hopefully, an expected one.

Article size changes can be a clue to the nature of any problems in the server if it is not keeping up with the incoming feed, which shows up as a drop in the incoming byte rate or a backlog on the feeding site. If the average article size is smaller than normal, there may be a network bandwidth problem, allowing small articles to pass and delaying binaries. A drop in article size suggests that large articles are being delayed, and I usually check the disk and network performance first.

If the article size increases, it suggests that the time to check whether an article has previously been seen has increased. Of course, because article size changes without problems, these are clues rather than "the smoking gun" for problem identification. This information can be combined with the other tools described to help identify problems.

The article information (counts and accepts) was added to provide data requested at one time by management. I seldom find those numbers useful for identifying problems, but they may be useful for planning.

The default sample interval is one minute, on the minute. This is a sample large enough to smooth many normal fluctuations, but small enough to be useful after getting a cup of coffee, but not after going to lunch. By forcing the samples on the minute, data for multiple servers can be watched in several windows, assuming your clocks are properly set.

Where is innd Spending Its Time?

The daily report generated by the INN default daily cleanup shows useful data about where clock time is spent by the server. Unfortunately, it only gives the high, low, and average times for all of the operations of the innd daemon. And the current (as of February 20, 2001) development version gives many more categories of time resolution, perhaps more than are useful without some study of their meaning. The timelog.pl program scans the news.notice log and processes the history and daemon time data to produce a report (Listing 4). This report contains some of the more useful data given every time the system writes a time record to the log file. Because the time interval can be set in inn.conf, the granularity of the samples can be controlled.

Note that the headers were provided by hand. This is a work in progress, so any labels are shooting at a moving target. The program only generates the data lines.

Using the timelog.pl Output

I usually try to maximize the idle time parameter with servers taking a large feed. As the idle time gets smaller, the probability of backlog and lost articles becomes higher. The server in the example is not bogged down, but I would be happier with more idle time. The trick is to see where the time really is going, as shown by the other data items. If the daily report shows that other sections of the daemon are taking excessive time, other fields may need to be added or substituted in the sample. The overview write on reader machines, and the Perl filter used to protect against unwanted articles, are both areas that can consume more time than desired.

The two common bottlenecks are article write (the article text), and history write (the time the record indices to the article). They appear here as both time-per-article and percentage of the sample time used for the operation. Remember this is clock time, not CPU time.

Note that on this machine both write operations have a fairly large time in ms/article. This reflects less-than-optimal performance of the disk subsystem attached to this server. I am actively tuning several of these, looking for better overall performance through network and disk tuning.

The next two data items reflect the history "have" time, or time to look up an article being offered to see whether the server has already seen it. On a transit server, a lot of feeds come in, resulting in many duplicates as articles are offered from several feeds. The time for lookups is pretty good on this server, rising to a few ms only during the daily expire process.

To reduce disk i/o, parts of the history file information are cached in memory. Obviously, it is desirable to have the information in cache rather than reading disk, so the percentage of misses in the cache can indicate need for tuning. In this case, the misses are higher than I would like, but doubling the cache size didn't help. This is because the machine has some really slow feeds, which offer articles first seen many seconds ago. The good "have" time indicates that even if the information is not in cache, the operating system (Linux in this case) has saved the required information in disk buffers, so physical disk operations are not needed. While I would like to get misses down to <1%, it's obvious that there is not much to be gained on this system because the time spent in lookup is essentially zero.

By use of the two monitoring tools for short-term performance, and reading the daily reports for overall performance, it's possible to find problems, improve performance, and most importantly to document that you are using the machine well and not using (or asking for) more machine than you need. DBAs can be helpful by telling management how much they would spend to handle adding and deleting about a million records a day, totaling about 200 GB.

Twister Report Generation Tools

While INN is open source software, Twister, Cyclone, and Typhoon are typical commercial software. They come with manuals and technical support, but no source and some of the most useful reports generated as these programs run are in output lines over 200 characters wide. Because you will not need to modify code, you will use normal UNIX tools to monitor disk, network, and CPU use, rather than logs detailing the internal operation of the program. You still have to monitor how the users utilize the system, however, and I have written some small tools to help me do that.

Looking at the Local Posts

One thing I occasionally must check is what and how much the users post on a server, or who actually posted a message on a server (if in fact any of the users did). The source of this information is the "localposts" file, which has a great deal of information on very wide lines. The localpost.pl program (Listing 5) generates information in a less demanding format, which will fit a 100-column window or typical small printer.

The information I extract is the post time, size of the post, message-id, the poster's identity, and the groups to which it was posted. In case I have to work from a very limited terminal, there is an option (-m) to present the information on multiple lines, each less than 80 columns wide. An example of the narrow output is shown in Listing 6.

Looking at the Details

At the close of each reader socket, details of the socket connect are output on a single line of the stats_log.detailed file giving number of overview requests, articles read, groups visited, posts, and the total bytes involved in each type of request. The log is reformatted by the detail.pl program (Listing 7) to make these most useful details available on a much shorter line. Frequently, the detail is suppressed (-S) and only a summary (Listing 8) report is produced. A summary of activity by each user may also be produced to provide information for identifying very active users. The options are shown in (Listing 9).

Because each line includes a user class (feed type) field, there is an option (-G) to select only records from that group. There is an option (-U) to generate a table of user connections open vs. time, but it proved to be no more useful or accurate than similar information on an hourly basis in stats_log.readers.

Looking at the Reader Behavior

The information in stats_log.readers is on lines about 170 columns wide. It includes start and stop time in fractional seconds "UNIX time" (since January 1, 1970 GMT), and all the information on the activity of the socket, some details of which are seldom useful. This hourly summary of reader behavior is often helpful in understanding what the server is being asked to do.

The readers.pl program (Listing 10) reformats the most commonly wanted information to a narrower (120 column) format with headers, which fits in a window or on a printer page. Because the output is given as a single line per hour for each feed (connection class), the whole output can often be easily browsed. Listing 3 shows what information is retained in the more compact version.

Using the Tools

Unless you run a large and assorted bunch of news machines, chances are that you will not use all of the tools described here. However, one set or the other should be useful, and they are written so that they may be easily modified to generate the output most useful to you.

Bill Davidsen was a system programmer at GE's Corporate Research and Development Center, consulting part time through TMR Associates, which became full time when he left GE. He ran Usenet news at GE and is now project leader for Usenet news at a national ISP.