Monitoring Usenet News
Bill Davidsen
Running Usenet news on a large server has never been easy because
the nature of Usenet has been to expand to fill all resources. In
the days of dial-up modems when I started running news, the Telebit
Trailblazer modem came out with 18-Kb connections. We thought the
bandwidth problems were over, because we could handle several MB
a day -- now we handle several MB a second. To help run news
on the least hardware available, I have generated some tools to
help me see what the systems are doing.
There are three major Usenet players in the UNIX world: INN, the
oldest and probably most widely used; Diablo, which evolved from
a simple multi-threaded transit server to a scalable software for
readers and transit; and Cyclone, the only commercial product noted
here, originally from Highwinds Software. Each of these includes
some reporting tools that can generate daily reports, graphs, and
other high-level summaries of what the system is doing.
However, it's sometimes nice to have a report on system activity
that falls between a daily summary and the second-to-second detail
of the raw log files. Additionally, these tools work well to generate
single page snapshots that fit on a slide allowing you to keep management
informed.
INN Report Generation Tools
One thing to watch is how much news you get over the course of
a day. This small Perl program, feedwatch.pl (Listing 1),
produces a table of feed characteristics versus time of day. This
can be used to monitor the data flow of the system at regular intervals
-- by default, every minute, on the minute. The options available
are available by the -? option, as shown in Listing 2. (All
listings for this article are available from: www.sysadminmag.com.)
Feedwatch, by default, produces a tabular list of output values
(Listing 3) from the "news" log file (by default in the
/usr/local/news/log directory), which can be fed into sort,
gnuplot, or additional analysis programs. I use this output
on a minute-to-minute basis to observe the impact of changes to
my incoming feeds, network changes, or the tuning of the machine.
The output is from a transit server (sort of a news router), which
gets most of a full feed, and, at the time I grabbed this snapshot,
was accepting about 2.5 MB/sec.
The example in Listing 3 has the headers enabled (-H) and
shown every 50 lines (-L50), with output generated every 3600 sec
(-i3600) to show a full day of output. The default is headers every
10 samples and output every minute, on the minute. Without the headers,
the output may easily be sorted on any of the fields, or even used
to produce plots via gnuplot when the plots produced by the
tools included with INN don't have exactly what you need.
Using the Output Data
It's important to note that this output reflects articles
actually transferred over the network, not articles that were offered
by feeding sites, because articles already seen are rejected. Everything
you see as offered, junked, or rejected, is taking your bandwidth.
Some useful fields are the article offer rate (how many articles
transferred from your feeds), the accept rate, and the data rate
for incoming articles. The most commonly checked value is probably
the incoming data rate (art KB/sec). This number will quickly change
in response, network or feeding host changes. I refer to this a
lot when I'm tuning a server that is having a hard time keeping
up with a large feed.
The difference between the offer rate and the accept rate shows
how many articles were transferred over the network and then rejected
because of unwanted newsgroups or other conditions found by the
filter (if you use one). Filtering on this server is minimal, but
certain groups are dropped for various reason. If many articles
are being dropped, you can be nice to your network by working with
feeding sites to tune the feed and simply not send articles you
don't want to.
The article size is a useful indicator of the content of the incoming
feed. Obviously, if you carry binary groups, you will see a larger
average article size than a text-only feed. A sudden change in the
article size indicates a feed change -- hopefully, an expected
one.
Article size changes can be a clue to the nature of any problems
in the server if it is not keeping up with the incoming feed, which
shows up as a drop in the incoming byte rate or a backlog on the
feeding site. If the average article size is smaller than normal,
there may be a network bandwidth problem, allowing small articles
to pass and delaying binaries. A drop in article size suggests that
large articles are being delayed, and I usually check the disk and
network performance first.
If the article size increases, it suggests that the time to check
whether an article has previously been seen has increased. Of course,
because article size changes without problems, these are clues rather
than "the smoking gun" for problem identification. This
information can be combined with the other tools described to help
identify problems.
The article information (counts and accepts) was added to provide
data requested at one time by management. I seldom find those numbers
useful for identifying problems, but they may be useful for planning.
The default sample interval is one minute, on the minute. This
is a sample large enough to smooth many normal fluctuations, but
small enough to be useful after getting a cup of coffee, but not
after going to lunch. By forcing the samples on the minute, data
for multiple servers can be watched in several windows, assuming
your clocks are properly set.
Where is innd Spending Its Time?
The daily report generated by the INN default daily cleanup shows
useful data about where clock time is spent by the server. Unfortunately,
it only gives the high, low, and average times for all of the operations
of the innd daemon. And the current (as of February 20, 2001)
development version gives many more categories of time resolution,
perhaps more than are useful without some study of their meaning.
The timelog.pl program scans the news.notice log and
processes the history and daemon time data to produce a report (Listing
4). This report contains some of the more useful data given every
time the system writes a time record to the log file. Because the
time interval can be set in inn.conf, the granularity of
the samples can be controlled.
Note that the headers were provided by hand. This is a work in
progress, so any labels are shooting at a moving target. The program
only generates the data lines.
Using the timelog.pl Output
I usually try to maximize the idle time parameter with servers
taking a large feed. As the idle time gets smaller, the probability
of backlog and lost articles becomes higher. The server in the example
is not bogged down, but I would be happier with more idle time.
The trick is to see where the time really is going, as shown by
the other data items. If the daily report shows that other sections
of the daemon are taking excessive time, other fields may need to
be added or substituted in the sample. The overview write on reader
machines, and the Perl filter used to protect against unwanted articles,
are both areas that can consume more time than desired.
The two common bottlenecks are article write (the article text),
and history write (the time the record indices to the article).
They appear here as both time-per-article and percentage of the
sample time used for the operation. Remember this is clock time,
not CPU time.
Note that on this machine both write operations have a fairly
large time in ms/article. This reflects less-than-optimal performance
of the disk subsystem attached to this server. I am actively tuning
several of these, looking for better overall performance through
network and disk tuning.
The next two data items reflect the history "have" time,
or time to look up an article being offered to see whether the server
has already seen it. On a transit server, a lot of feeds come in,
resulting in many duplicates as articles are offered from several
feeds. The time for lookups is pretty good on this server, rising
to a few ms only during the daily expire process.
To reduce disk i/o, parts of the history file information are
cached in memory. Obviously, it is desirable to have the information
in cache rather than reading disk, so the percentage of misses in
the cache can indicate need for tuning. In this case, the misses
are higher than I would like, but doubling the cache size didn't
help. This is because the machine has some really slow feeds, which
offer articles first seen many seconds ago. The good "have"
time indicates that even if the information is not in cache, the
operating system (Linux in this case) has saved the required information
in disk buffers, so physical disk operations are not needed. While
I would like to get misses down to <1%, it's obvious that
there is not much to be gained on this system because the time spent
in lookup is essentially zero.
By use of the two monitoring tools for short-term performance,
and reading the daily reports for overall performance, it's
possible to find problems, improve performance, and most importantly
to document that you are using the machine well and not using (or
asking for) more machine than you need. DBAs can be helpful by telling
management how much they would spend to handle adding and deleting
about a million records a day, totaling about 200 GB.
Twister Report Generation Tools
While INN is open source software, Twister, Cyclone, and Typhoon
are typical commercial software. They come with manuals and technical
support, but no source and some of the most useful reports generated
as these programs run are in output lines over 200 characters wide.
Because you will not need to modify code, you will use normal UNIX
tools to monitor disk, network, and CPU use, rather than logs detailing
the internal operation of the program. You still have to monitor
how the users utilize the system, however, and I have written some
small tools to help me do that.
Looking at the Local Posts
One thing I occasionally must check is what and how much the users
post on a server, or who actually posted a message on a server (if
in fact any of the users did). The source of this information is
the "localposts" file, which has a great deal of information
on very wide lines. The localpost.pl program (Listing 5)
generates information in a less demanding format, which will fit
a 100-column window or typical small printer.
The information I extract is the post time, size of the post,
message-id, the poster's identity, and the groups to which
it was posted. In case I have to work from a very limited terminal,
there is an option (-m) to present the information on multiple
lines, each less than 80 columns wide. An example of the narrow
output is shown in Listing 6.
Looking at the Details
At the close of each reader socket, details of the socket connect
are output on a single line of the stats_log.detailed file
giving number of overview requests, articles read, groups visited,
posts, and the total bytes involved in each type of request. The
log is reformatted by the detail.pl program (Listing 7) to
make these most useful details available on a much shorter line.
Frequently, the detail is suppressed (-S) and only a summary
(Listing 8) report is produced. A summary of activity by each user
may also be produced to provide information for identifying very
active users. The options are shown in (Listing 9).
Because each line includes a user class (feed type) field, there
is an option (-G) to select only records from that group.
There is an option (-U) to generate a table of user connections
open vs. time, but it proved to be no more useful or accurate than
similar information on an hourly basis in stats_log.readers.
Looking at the Reader Behavior
The information in stats_log.readers is on lines about
170 columns wide. It includes start and stop time in fractional
seconds "UNIX time" (since January 1, 1970 GMT), and all
the information on the activity of the socket, some details of which
are seldom useful. This hourly summary of reader behavior is often
helpful in understanding what the server is being asked to do.
The readers.pl program (Listing 10) reformats the most
commonly wanted information to a narrower (120 column) format with
headers, which fits in a window or on a printer page. Because the
output is given as a single line per hour for each feed (connection
class), the whole output can often be easily browsed. Listing 3
shows what information is retained in the more compact version.
Using the Tools
Unless you run a large and assorted bunch of news machines, chances
are that you will not use all of the tools described here. However,
one set or the other should be useful, and they are written so that
they may be easily modified to generate the output most useful to
you.
Bill Davidsen was a system programmer at GE's Corporate
Research and Development Center, consulting part time through TMR
Associates, which became full time when he left GE. He ran Usenet
news at GE and is now project leader for Usenet news at a national
ISP.
|