Inexpensive Network Troubleshooting
Jonathan Feldman
System administration is more than simply keeping servers and workstations in good working order. The state of the network connecting the servers and workstations is also of major importance. When problems arise on the network, however, determining exactly what is causing the problem can be challenging, particularly if you do not possess sophisticated network troubleshooting tools. This article explores how to go about troubleshooting your network inexpensively.
Just about all network troubleshooting can be inexpensive. Deduction, logical reasoning, and simple tools can almost always pinpoint a problem. The process of substitution and rule-out is powerful: you may not be able to name your pain without that $10,000 network analyzer, but you can typically locate the problem and remove it by applying your basic understanding of how your network ought to be working.
Unless you are in the engineering business, attacking day-to-day networking problems with a network sniffer can actually be overkill. Looking at a raw dump from a busy and troubled network usually presents far too much data, particularly if you do not use some form of packet filtering on the sniffer. So, it's a good idea, even if you do have such a device, to apply a little "what if" reasoning before you sniff.
When attacking any network problem, it's best to have as much information as you possibly can about the nature of the problem. Fortunately, many free UNIX utilities can be excellent sources of this information. And even better, many of these utilities have been ported to other platforms, which allows you to leverage your UNIX knowledge to solve multi-platform problems. It's kind of neat to be able to do a netstat -a on Windows 95.
But where to start? Well, it's useful to think about network problems in two ways: workstation or local segment problems, or switched or routed environment problems. I typically start with local problem determination, and then move on to the routed or switched environment. For example, having a user change workstations can be quite revealing. It's very important to make sure that your workstations and segments or rings are okay before pointing the finger at routers or switches. Having a local problem can often make it look as if a router or a switch is to blame, when in fact it is not.
Segment Snooping: "What's Changed?..."
The first place to start is on the physical layer, specifically on the host or workstation that you're troubleshooting. Many UNIX implementations have some sort of ethstat command, which will reveal driver-level errors. In addition, the netstat -i command will show interface-level error rates, as will ifconfig. You can see persistent error conditions by repeating the command and watching an error count climb.
For example, under SCO, you might see:
# netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
net1 1500 200.1.1 endor 1357322 0 2228930 141 422
lo0 8232 loopback localhost 7585 0 7585 0 0
The "Ipkts" and "Ierrs" columns stand for Input Packets and Input Errors, respectively. "Opkts" and "Oerrs" denote output packets and errors. This SCO system's administrator might surmise that all is well because the number of output errors compared to the number of packets total is very small and is not climbing.
A Linux user might also type this command and get a similar, though not exact, output. However, a Linux user could also use the proc filesystem to see what's going on in the kernel:
# cat /proc/net/dev
Inter-| Receive | Transmit
face |packets errs drop fifo frame|packets errs drop fifo colls carrier
lo: 0 0 0 0 0 205 0 0 0 0 0
tr0: 5626 0 0 0 0 5181 0 0 0 0 0
dummy: No statistics available.
ppp0: 3336 0 0 0 0 2656 0 0 0 0 0
You can rule out a physical network problem by checking out other stations on the network: are they also noticing errors on the network? If not, the problem is probably not on the wire, and you'll need to rule out the workstation, then move on to routing or switching equipment.
In a case where you have ruled out everything except the wire, you might need to borrow a cable scanner from a friend, or even rent one. However, it's always much less expensive to start asking people "so, what's changed recently?" The answer, of course, will be "nothing," but keep asking; you'll find out that either a new station's been added to the segment, or someone shut off an "unnecessary" hub.
When you are certain that your problem is a physical one, the worst-case scenario is that you might have to "divide and conquer" the segment. This is when you shut off half of the segment, check for continued trouble, then shut off half of that, and so on. However, this is really a last resort, especially during working hours, when turning off access to the network might be construed as antisocial.
Sometimes a couple of machines will start yelling about a problem with a particular station on the network, referring to it either by IP address or MAC address, for example, when a duplicate IP address is detected. Since you have kept documentation of your MAC addresses, finding this station will be no problem, right? If you haven't kept such good records, or are working with someone else's undocumented network, you might want to start collecting this information.
You can use arp in conjunction with ping to collect a list of MAC translations from IP. Many UNIX implementations will allow you to ping the broadcast address, for example, 167.195.160.255. You'll need to be physically connected to the segment, of course, but after you've pinged the heck out of it, arp -a should give you a list of the MACs currently on the segment, for example
# arp -a
Address HWtype HWaddress Flags Mask Iface
167.195.160.6 ether 00:06:31:24:45:BD C * eth0
167.195.160.9 ether 00:00:B4:3B:A0:51 C * eth0
It's very helpful to have such a list, especially before a problem strikes! Remember that you can find the manufacturer of the network card by looking at the first three hex bytes, called the Organizationally Unique Identifier (OUI) by the IEEE, which, like it sounds, are unique to each manufacturer. For example, the 00-06-31 card up there is a 3Com card, and the 00-00-B4 card is an "Edimax" card, which happens to be an NE2000 clone. (There is a very good "unofficial" OUI list at http://www.net3group.com/oui.htm.) If you know that you do not have a card from a particular MAC address that is listed, you might want to start thinking "hardware problem." Also, bizarre-looking values such as 00-00-02-02-02-02 are major hints that there is a physical problem on the segment!
Supposing that the local network segment seems okay, it still is a good idea to check out the workstations or hosts involved. Here's where netstat can be a lot of help. For example, a netstat -a will show you all of your active network connections, whether client or server. I run netstat -an just about every time I get a trouble call. I usually use the -n flag (show numeric, no host names) because things get hairy enough without dragging name resolution into the picture. For example, say that somebody calls from the 151.1.26 network, saying that they can't get into the host degobah, with a host address of 167.195.162.5. I first telnet to degobah from my segment, then, on degobah, type:
$ netstat -an | awk '/Proto|151.1.26/ {print $0}'
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp 0 0 167.195.162.5.23 151.1.26.30.1773 ESTABLISHED
tcp 0 0 167.195.162.5.23 151.1.26.50.1798 ESTABLISHED
tcp 0 0 167.195.162.5.23 151.1.26.49.1822 ESTABLISHED
tcp 0 0 167.195.162.5.23 151.1.26.31.1610 ESTABLISHED
I immediately see that other folks from 151.1.26 can get to the telnet socket (23) on degobah, they've got ESTABLISHED connections! (See the man page for other states.) So, this might involve an entire segment if it's more than one person, but more likely, I'm dealing with a misconfigured workstation (since the phone hasn't exploded with others).
Or, say a user is trying to print from an app, and the application developer has shunted the call to you. The user's trying to print from the host to a weird client-server spooler somewhere on another users's station. You find out the server's socket number, say, 3986, and see if any activity is happening when the user prints
$ while true ; do netstat -an | grep 3986 | grep user-ip-number ; sleep 1 ; done
If you don't see any output in the "Foreign address" column, something in the app isn't kicking off correctly, and you'll need to find the application support person. If you do see output, it's time to focus on the output end-station or what's in between; something might be preventing the output acceptance.
Even if you do see output here, pay careful attention to the "Send-Q" column, since you are trying to send something to another station. A high or climbing value here means that the host is trying to send, but is being deferred, possibly due to a busy far-end station, but it might also signal a network or routing problem. Again, find out whether this end-station is the only one with a climbing Send-Q to narrow it down to the station or the environment. There are, however, normal situations where you will see temporarily high Send-Q values, such as a hardware print server without a hard drive to queue data.
Similarly, the Recv-Q indicates how much of a receive backlog the host has. A high or climbing value is bad, and will merit further study. Check your sar report: are you running out of buffers or memory? Or do you just have a lot of traffic? Some versions of ifconfig or ethstat report numbers of packets transmitted or received; a crude way to estimate traffic on a host without an analyzer is to subtract values between ifconfig or ethstat samples.
On the Windows 95/NT side, netstat can be useful to see if a client program is doing what it's supposed to be. For example, if a user can't use the Socks firewall at address 151.1.26.15, you might want to kick off a netstat -a in a DOS window while the browser is trying to make a connection. If you don't see a socket connection initiated to 1080 (the default Socks socket), it probably means that the client isn't configured properly.
The neat thing is, most of the above will also work on Windows 95 and NT, as arp, netstat, and ping work pretty much the same way on those operating systems.
BROUTE & SWITCH: "...Nothing!"
Whether you're bridging, routing, or switching, problems with these fall into two categories: someone's been messing with it, or you've had a hardware failure. Once a switch or a router is up and functioning, very little will stop it, other than a well-intentioned human or a well-intentioned but failing piece of hardware. We always look for human-originated problems first.
After you've queried everyone in your shop, all of whom have developed sudden amnesia as to their recent activities, and figured out that no hardware has failed, it's time to roll up your sleeves and figure out what has changed. First, get out your handy network map, which you of course have been updating as you expand your network. Next, find out who can't talk to who. Then grab your router or switch manual and start testing the links piece by piece, armed with ping, traceroute, and netstat -rn.
Let's say, for instance, that the South Side campus can't talk to the Downtown campus, reported in the guise of "no one at the South Side can get to the UNIX application server Downtown." You happen to work Downtown, so you look at the section of the network map of interest Figure 1, and try to ping the router out on the South Side, at 167.195.160.1.2. It doesn't work, as reported. Moving inwards, you try to ping 167.195.160.1, and this too fails. A traceroute from your station on 167.195.163 to the South Side ring, 167.195.199.0 shows that your station tries to get to the "Downtown Campus Router," and fails. A ping of the Downtown router works from the ring that it's on, but when you hop onto the UNIX host, you can neither ping the Downtown router's Token Ring IP address, nor its the Wide Area IP address. What's going on? Apparently, the routing table is damaged. Could there be daemons on the wire?
Only human ones. This actually happened to us, and took a bit of troubleshooting to discover. Since the initial investigation revealed that something was up with the Downtown router, we used the Downtown router's utilities to reveal that the router thought that the best route to 167.195.161 was via the Novell Server! So, we ran over to the server in question and checked its IP configuration. Apparently, the tech who put it together had gone through the install menus, and hit "yes" for IP forwarding. To compound this, he had given the Fast Ethernet card the network number of the regular Ethernet, which hadn't been bridged over to the new switch yet. So, this multi-homed server started to advertise that it had the best route to the regular Ethernet. Unfortunately, it wasn't even connected to the regular Ethernet, just the fast Ethernet! Even though the server wasn't in production yet, it still caused a major headache.
The lesson is (apart from putting test servers on test networks), if you have a multiplatform environment, you need to have enough knowledge on each platform to be able to check basics like routing tables, ARP caches, and so on. Under NT and Windows 95, this is easy because many of the utilities are named the same, even ifconfig, (winipcfg under 95). Some utilities, however, are named slightly differently. Under Novell, you'll have to use utilities such as TCPCON (which will show you routing tables and arp tables) and INETCFG to check your configuration. A recent release of a TCP patch for Novell finally includes ping as well, which you'll want to get from http://support.novell.com.
Even with switches in the picture, most of us with a large campus still have a router or two out there, particularly if you have more than one type of media (Token Ring, Ethernet, FDDI, etc.) in the picture. So, traceroute is almost a necessity. Amazingly enough, some UNIX distributions don't include a traceroute command! If you're unlucky enough to be in this boat, get traceroute source code from Lawrence Berkeley Labs (LBL) at ftp://ftp.ee.lbl.gov/traceroute.tar.Z, compile it, and have fun.
Switches basically behave like bridges on steroids, and can present difficulty, particularly if multiple segments are documented as one segment, leading you to believe that you can put a meter or analyzer on it. You can do so, but you don't have the whole picture unless you are on the correct physical segment.
Analysis: "Tell Me about Your Mom's Network..."
Even when you do your homework, a network analyzer of some sort may still be necessary. For example, let's say that you have ruled out a problem with printing to multiple queues at a given LPD print server (other hosts can print to it), and you have ruled IN the host (it has the same problem with other LPD print servers). If you report this to the host OS vendor, you're sure to get the runaround, especially when dealing with first-tier support. The only way you're going to avoid the horrible finger-pointing game is through the use of hard data. Yes, you can support your claim with your free tools (netstat is a good one in this case), but most vendors will insist on some sort of analyzer trace, probably because they figure that they won't get one and you'll just drop the issue.
Contrary to popular belief, however, analyzers don't have to be hugely expensive to be usable. In the troubleshooting context, an analyzer's germane function is to capture packets and display them in a human-readable format. You get that with any commercial analyzer, and even with the free ones. To prove the above "multiple queue" problem to your vendor, you'd kick off a couple of print jobs to multiple queues, and prove that the host only processes one queue's request at a time; the rest are not even showing up from the host until the first one is done. So as not to send a 5MB tracefile to the vendor, you'd probably want to use a capture filter based upon the print server's MAC or IP address, and possibly reduce it further by capturing only TCP socket 515 (lpd).
So, use what you know, or what is most affordable for you. If you are an AIX hound, use the iptrace service, which comes with AIX, startable with startsrc -s iptrace. (See the man page for details.) If you feel like having an adventure, grab tcpdump, which is an oldie but a goodie, and has been updated by LBL to "compile under an ANSI C compiler."
On the free side, sniffit is fairly new on the scene. I have been pretty impressed so far. You can get it from any number of Linux mirror sites, or from its home in Belgium at http://reptile.rug.ac.be/~coder/ \ sniffit/sniffit.html. It will do ASCII dumps of packets, unlike tcpdump. Yes, this makes it a cracker tool, and the documentation rather bizarrely points out how to log folks' passwords, but this seems to be over-enthusiasm rather than malice. It nonetheless seems to be a very good tool for the working network troubleshooter: it does a reasonable job at capturing packets, allowing filters, logging, and captures IP, ICMP, TCP, UDP, and has a DNS plug-in.
It has a very nice-looking interactive mode, which allows for a fairly clear flyover of a given network. It shows current IP connections on the segment, along with port numbers, it allows you to change your filter on the fly, and you can choose to show a statistics window and a log window. The log window in Figure 2 shows the escape codes that are redrawing the sniffit screen on my terminal emulator.
When you're serious about solving a problem, however, you might want to use its non-interactive mode to capture packets to a file for later analysis. You can opt to use command-line parameters to specify your filter, or use a configuration file. One neat feature is that you have the option to log only the first n bytes of a connection-oriented trace, allowing larger traces with less fluff.
One of Sniffit's weaknesses seems to be its documentation; some of the language and explanations are a bit confusing at first, but once you get through it, it is rather simple to use. Also, since the trace is in pure ASCII, it is cumbersome to filter it once you've already captured it, and may be difficult or unreadable for a vendor who is not familiar with the package. It is known to work with the more modern Linuxes (I couldn't get it to run on kernel 1.2.x, but it worked fine with 2.0.30) as well as IRIX, Solaris, and FreeBSD, and is probably more featureful than anything else that is free.
If you choose to go the less-expensive commercial route, check out http://ftp.co.chatham.ga.us/analyzers.html for links to some of the less expensive or free analyzers. Some of them are quite good; we use LANalyzer and have been very happy with it. We've also used LANDecoder with success as well. All other things being equal, some of the bells and whistles to look for in the vendor's propaganda include packet generation, auto-name discovery, and the ability to save tracefiles in different formats, so that your vendor can read your traces.
While having a network analyzer can be useful (or necessary, in some cases), the foregoing examples show that much can be done in the way of network troubleshooting by combining standard utilities with a little deductive reasoning. That combination translates into inexpensive, and is in keeping with the UNIX tradition of doing more than anyone thought possible with what you already have.
About the Author
Jonathan Feldman works with a couple of flavors of UNIX, NetWare, and NT in his role as Technical Systems Manager at the Chatham County Government in Savannah, GA.
About the Author
Jonathan Feldman works with a couple of flavors of UNIX, NetWare, and NT in his role as Technical Systems Manager at the Chatham County Government in Savannah, GA.
|