| Monitoring Usenet News Bill Davidsen
              Running Usenet news on a large server has never been easy because 
              the nature of Usenet has been to expand to fill all resources. In 
              the days of dial-up modems when I started running news, the Telebit 
              Trailblazer modem came out with 18-Kb connections. We thought the 
              bandwidth problems were over, because we could handle several MB 
              a day -- now we handle several MB a second. To help run news 
              on the least hardware available, I have generated some tools to 
              help me see what the systems are doing.
              There are three major Usenet players in the UNIX world: INN, the 
              oldest and probably most widely used; Diablo, which evolved from 
              a simple multi-threaded transit server to a scalable software for 
              readers and transit; and Cyclone, the only commercial product noted 
              here, originally from Highwinds Software. Each of these includes 
              some reporting tools that can generate daily reports, graphs, and 
              other high-level summaries of what the system is doing.
              However, it's sometimes nice to have a report on system activity 
              that falls between a daily summary and the second-to-second detail 
              of the raw log files. Additionally, these tools work well to generate 
              single page snapshots that fit on a slide allowing you to keep management 
              informed.
              INN Report Generation Tools
              One thing to watch is how much news you get over the course of 
              a day. This small Perl program, feedwatch.pl (Listing 1), 
              produces a table of feed characteristics versus time of day. This 
              can be used to monitor the data flow of the system at regular intervals 
              -- by default, every minute, on the minute. The options available 
              are available by the -? option, as shown in Listing 2. (All 
              listings for this article are available from: www.sysadminmag.com.)
              Feedwatch, by default, produces a tabular list of output values 
              (Listing 3) from the "news" log file (by default in the 
              /usr/local/news/log directory), which can be fed into sort, 
              gnuplot, or additional analysis programs. I use this output 
              on a minute-to-minute basis to observe the impact of changes to 
              my incoming feeds, network changes, or the tuning of the machine. 
              The output is from a transit server (sort of a news router), which 
              gets most of a full feed, and, at the time I grabbed this snapshot, 
              was accepting about 2.5 MB/sec.
              The example in Listing 3 has the headers enabled (-H) and 
              shown every 50 lines (-L50), with output generated every 3600 sec 
              (-i3600) to show a full day of output. The default is headers every 
              10 samples and output every minute, on the minute. Without the headers, 
              the output may easily be sorted on any of the fields, or even used 
              to produce plots via gnuplot when the plots produced by the 
              tools included with INN don't have exactly what you need.
              Using the Output Data
              It's important to note that this output reflects articles 
              actually transferred over the network, not articles that were offered 
              by feeding sites, because articles already seen are rejected. Everything 
              you see as offered, junked, or rejected, is taking your bandwidth.
              Some useful fields are the article offer rate (how many articles 
              transferred from your feeds), the accept rate, and the data rate 
              for incoming articles. The most commonly checked value is probably 
              the incoming data rate (art KB/sec). This number will quickly change 
              in response, network or feeding host changes. I refer to this a 
              lot when I'm tuning a server that is having a hard time keeping 
              up with a large feed.
              The difference between the offer rate and the accept rate shows 
              how many articles were transferred over the network and then rejected 
              because of unwanted newsgroups or other conditions found by the 
              filter (if you use one). Filtering on this server is minimal, but 
              certain groups are dropped for various reason. If many articles 
              are being dropped, you can be nice to your network by working with 
              feeding sites to tune the feed and simply not send articles you 
              don't want to.
              The article size is a useful indicator of the content of the incoming 
              feed. Obviously, if you carry binary groups, you will see a larger 
              average article size than a text-only feed. A sudden change in the 
              article size indicates a feed change -- hopefully, an expected 
              one.
              Article size changes can be a clue to the nature of any problems 
              in the server if it is not keeping up with the incoming feed, which 
              shows up as a drop in the incoming byte rate or a backlog on the 
              feeding site. If the average article size is smaller than normal, 
              there may be a network bandwidth problem, allowing small articles 
              to pass and delaying binaries. A drop in article size suggests that 
              large articles are being delayed, and I usually check the disk and 
              network performance first.
              If the article size increases, it suggests that the time to check 
              whether an article has previously been seen has increased. Of course, 
              because article size changes without problems, these are clues rather 
              than "the smoking gun" for problem identification. This 
              information can be combined with the other tools described to help 
              identify problems.
              The article information (counts and accepts) was added to provide 
              data requested at one time by management. I seldom find those numbers 
              useful for identifying problems, but they may be useful for planning.
              The default sample interval is one minute, on the minute. This 
              is a sample large enough to smooth many normal fluctuations, but 
              small enough to be useful after getting a cup of coffee, but not 
              after going to lunch. By forcing the samples on the minute, data 
              for multiple servers can be watched in several windows, assuming 
              your clocks are properly set.
              Where is innd Spending Its Time?
              The daily report generated by the INN default daily cleanup shows 
              useful data about where clock time is spent by the server. Unfortunately, 
              it only gives the high, low, and average times for all of the operations 
              of the innd daemon. And the current (as of February 20, 2001) 
              development version gives many more categories of time resolution, 
              perhaps more than are useful without some study of their meaning. 
              The timelog.pl program scans the news.notice log and 
              processes the history and daemon time data to produce a report (Listing 
              4). This report contains some of the more useful data given every 
              time the system writes a time record to the log file. Because the 
              time interval can be set in inn.conf, the granularity of 
              the samples can be controlled.
              Note that the headers were provided by hand. This is a work in 
              progress, so any labels are shooting at a moving target. The program 
              only generates the data lines.
              Using the timelog.pl Output
              I usually try to maximize the idle time parameter with servers 
              taking a large feed. As the idle time gets smaller, the probability 
              of backlog and lost articles becomes higher. The server in the example 
              is not bogged down, but I would be happier with more idle time. 
              The trick is to see where the time really is going, as shown by 
              the other data items. If the daily report shows that other sections 
              of the daemon are taking excessive time, other fields may need to 
              be added or substituted in the sample. The overview write on reader 
              machines, and the Perl filter used to protect against unwanted articles, 
              are both areas that can consume more time than desired.
              The two common bottlenecks are article write (the article text), 
              and history write (the time the record indices to the article). 
              They appear here as both time-per-article and percentage of the 
              sample time used for the operation. Remember this is clock time, 
              not CPU time.
              Note that on this machine both write operations have a fairly 
              large time in ms/article. This reflects less-than-optimal performance 
              of the disk subsystem attached to this server. I am actively tuning 
              several of these, looking for better overall performance through 
              network and disk tuning.
              The next two data items reflect the history "have" time, 
              or time to look up an article being offered to see whether the server 
              has already seen it. On a transit server, a lot of feeds come in, 
              resulting in many duplicates as articles are offered from several 
              feeds. The time for lookups is pretty good on this server, rising 
              to a few ms only during the daily expire process.
              To reduce disk i/o, parts of the history file information are 
              cached in memory. Obviously, it is desirable to have the information 
              in cache rather than reading disk, so the percentage of misses in 
              the cache can indicate need for tuning. In this case, the misses 
              are higher than I would like, but doubling the cache size didn't 
              help. This is because the machine has some really slow feeds, which 
              offer articles first seen many seconds ago. The good "have" 
              time indicates that even if the information is not in cache, the 
              operating system (Linux in this case) has saved the required information 
              in disk buffers, so physical disk operations are not needed. While 
              I would like to get misses down to <1%, it's obvious that 
              there is not much to be gained on this system because the time spent 
              in lookup is essentially zero.
              By use of the two monitoring tools for short-term performance, 
              and reading the daily reports for overall performance, it's 
              possible to find problems, improve performance, and most importantly 
              to document that you are using the machine well and not using (or 
              asking for) more machine than you need. DBAs can be helpful by telling 
              management how much they would spend to handle adding and deleting 
              about a million records a day, totaling about 200 GB.
              Twister Report Generation Tools
              While INN is open source software, Twister, Cyclone, and Typhoon 
              are typical commercial software. They come with manuals and technical 
              support, but no source and some of the most useful reports generated 
              as these programs run are in output lines over 200 characters wide. 
              Because you will not need to modify code, you will use normal UNIX 
              tools to monitor disk, network, and CPU use, rather than logs detailing 
              the internal operation of the program. You still have to monitor 
              how the users utilize the system, however, and I have written some 
              small tools to help me do that.
              Looking at the Local Posts
              One thing I occasionally must check is what and how much the users 
              post on a server, or who actually posted a message on a server (if 
              in fact any of the users did). The source of this information is 
              the "localposts" file, which has a great deal of information 
              on very wide lines. The localpost.pl program (Listing 5) 
              generates information in a less demanding format, which will fit 
              a 100-column window or typical small printer.
              The information I extract is the post time, size of the post, 
              message-id, the poster's identity, and the groups to which 
              it was posted. In case I have to work from a very limited terminal, 
              there is an option (-m) to present the information on multiple 
              lines, each less than 80 columns wide. An example of the narrow 
              output is shown in Listing 6.
              Looking at the Details
              At the close of each reader socket, details of the socket connect 
              are output on a single line of the stats_log.detailed file 
              giving number of overview requests, articles read, groups visited, 
              posts, and the total bytes involved in each type of request. The 
              log is reformatted by the detail.pl program (Listing 7) to 
              make these most useful details available on a much shorter line. 
              Frequently, the detail is suppressed (-S) and only a summary 
              (Listing 8) report is produced. A summary of activity by each user 
              may also be produced to provide information for identifying very 
              active users. The options are shown in (Listing 9).
              Because each line includes a user class (feed type) field, there 
              is an option (-G) to select only records from that group. 
              There is an option (-U) to generate a table of user connections 
              open vs. time, but it proved to be no more useful or accurate than 
              similar information on an hourly basis in stats_log.readers.
              Looking at the Reader Behavior
              The information in stats_log.readers is on lines about 
              170 columns wide. It includes start and stop time in fractional 
              seconds "UNIX time" (since January 1, 1970 GMT), and all 
              the information on the activity of the socket, some details of which 
              are seldom useful. This hourly summary of reader behavior is often 
              helpful in understanding what the server is being asked to do.
              The readers.pl program (Listing 10) reformats the most 
              commonly wanted information to a narrower (120 column) format with 
              headers, which fits in a window or on a printer page. Because the 
              output is given as a single line per hour for each feed (connection 
              class), the whole output can often be easily browsed. Listing 3 
              shows what information is retained in the more compact version.
              Using the Tools
              Unless you run a large and assorted bunch of news machines, chances 
              are that you will not use all of the tools described here. However, 
              one set or the other should be useful, and they are written so that 
              they may be easily modified to generate the output most useful to 
              you.
              Bill Davidsen was a system programmer at GE's Corporate 
              Research and Development Center, consulting part time through TMR 
              Associates, which became full time when he left GE. He ran Usenet 
              news at GE and is now project leader for Usenet news at a national 
              ISP.
           |