Multi-Vendor LAN Troubleshooting

Tom Podnar

During an infrastructure planning session, our IT department realized that we were running out of IP addresses more quickly than anticipated. Given the current rate of expansion, we estimated that we would be out of usable IP addresses within 90 days. We Ddecided a NAT (Network Address Translation) implementation would remedy the situation and selected a Cisco router-based solution as the foundation of our NAT LAN upgrade project.

We quickly designed several new growth phases to our existing LAN, and purchased a Cisco 2621 router for the project. Within days, we integrated the Cisco 2621 into our existing LAN infrastructure and placed several workstations "behind" the NAT router for testing. There were no user complaints after two weeks. Everything appeared to be functioning as expected. We were pleased with the results, and intended to increase the beta program with an additional group of computers.

LAN trouble, however, often appears when least expected. A member of our IT staff arrived one day and logged into his NT 4.0 (SP 6a) client named galipoli. He checked his email using Netscape Communicator 4.7. When his computer contacted our corporate Red Hat Linux POP3 server, the "Receiving Messages" status box appeared, downloaded the first two messages, and then persisted on the screen. No error messages were generated, and the session never timed out. Each POP3 re-attempt was met with the same result. He verified that his Netscape client user profile was configured properly. Next, he verified that his network connectivity was fine by pinging the POP3 server. The configuration and connection all seemed okay. At this point, he brought the issue to the attention to the rest of the IT team. There were no reports of similar problems among other users. Therefore, we thought that the situation was probably an issue with that specific client's configuration.

We examined the user's mail spool file on the email server, and discovered that the third message was more than 1 MB. In the past, we had experienced situations, usually for dial-up users, where a large email message would cause similar problems. We deleted the third message in the mail spool, and the client was able to POP all of the mail off. However, the troubling "third" message had been sent to the entire IT staff, and no one else on the team had reported any problems downloading it. To verify that the problem still existed, we sent another large email to his email address and copied the entire IT team. Everyone was able to POP the mail message off, except that user's NT client, galipoli. For secondary verification of connectivity, we performed a telnet mail 110, at the same time that the Netscape mail client was "hanging" on the POP3 download attempt. The telnet attempt to the mail server's POP3 port was successful. We still thought the problem was local to galipoli, so we re-installed the Netscape Communicator software on it. This did not result in a solution.

Several members of the IT staff had just finished reading TCP/IP Illustrated, Volume 1, by W. Richard Stevens (Addison-Wesley), so we were eager to put our additional TCP/IP know-how to work for us. As illustrated by Figure 1, the only difference between galipoli and the other IT team member's workstations was that galipoli was connected to a 3COM 100-mbps unmanaged switch. This was the case because 3COM switch (C), had just been returned to us from 3COM on a warranty repair. We were in the process of verifying that the switch was operational, so the switch, 3COM (C), was not located in the wiring closet, but at the user's desk. We immediately thought that 3COM switch (C) was having some problems. We removed it from the network path, and tried to POP a large email again. The same symptoms appeared. At this point, we were actually unaware that galipoli was now directly plugged into 3COM switch (A). This specific 100-mbps switch, 3COM switch (A), was usually reserved for production servers. The removal of 3COM switch (C) should have resulted in galipoli being directly plugged into 3COM switch (B). Instead, due to an oversight, and a shortage of free 10-mbps ports, galipoli's LAN port had been directly connected to 3COM switch (A) in the wiring closet.

The only good news at this point was that the problem was consistent. A consistent network problem is a "good problem", because it's usually easier to track down. The architecture of the LAN had not changed for almost a year, except for the recent addition of the Cisco 2621 and its associated NAT subnet. Galipoli had been assigned a NAT'ed IP address. We had other workstations on the same NAT subnet as galipoli, which had no problems downloading large emails from the POP3 server. Because network problems can commonly be tracked down with a "what changed last" philosophy, NAT was our prime suspect. We changed galipoli's IP address back to being non-NAT'ed, but still physically cross-connected through 3COM 100-mbps switches (A) and (C). The problem vanished immediately. When galipoli was moved back to a NAT'ed IP address, the problem immediately reappeared. The only difference was that when galipoli had a non-NAT'ed IP address, its packets had a direct route to the firewall and were bypassing the Cisco 2621 router.

We examined the NAT configuration on the Cisco 2621(IOS version 12.1(6)) and verified that it was correct. To further debug, in the Cisco IOS, we ran the command, show IP NAT translations. This showed that a NAT translation table entry was created for the POP3 connection:

cisco-2621>show ip nat translations

Pro Inside global      Inside local       Outside local      Outside global
--- 207.238.210.231    172.16.10.9        ---                ---
tcp 207.238.210.230:2136 172.16.10.10:2136 207.238.210.5:110 207.238.210.5:110
tcp 207.238.210.230:2137 172.16.10.10:2137 207.238.210.5:110 207.238.210.5:110

Next, we ran a show IP NAT statistics, with the following result:

cisco-2621>show ip nat statistics

Total active translations: 5 (1 static, 4 dynamic; 4 extended)
Outside interfaces:
  Ethernet1/0
Inside interfaces:
  FastEthernet0/0
Hits: 7380067  Misses: 155877
Expired translations: 155969
Dynamic mappings:
-- Inside Source
access-list 1 pool cbsw-pool refcount 4
 pool cbsw-pool: netmask 255.255.255.240
        start 207.238.210.230 end 207.238.210.230
        type generic, total addresses 1, allocated 1 (100%), misses 1386

We noticed that the misses count seemed high, and it kept increasing with each POP3 session attempt. To find out exactly what the misses counter was used for, we turned to the Cisco technical support Web site. We found that it was normal for the misses counter to increase. Since each POP3 session was a new NAT'ed network connection initiation through the Cisco 2621, there would not be a pre-existing corresponding NAT table entry. Therefore, each new network connection caused a miss, and was a normal part of the NAT operation. From this, we concluded that there were no obvious NAT-related issues causing our problems.

Galipoli was running NT server 4.0, so we installed the Network Monitor Tools Service (packet sniffer) via the NT Control Panel. From galipoli, we started capturing TCP/IP packets, and initiated a POP3 download of a large email. The TCP/IP session was established correctly (SYN, SYN-ACK, ACK). We then observed that the packet flow containing the data of the email message started, paused for a second, restarted transmission, paused for about 2 seconds, restarted, paused for about 3 seconds, restarted, paused for 6 seconds, restarted, and then paused for about 12 seconds. The pause duration stayed at a consistent delay length for about a minute, and then began an exponential increase. We also noticed that the POP3 server's TCP sequence numbers kept gradually increasing, so it appeared that data was slowly getting through. But, we also noticed that ACKs, generated by both galipoli and our POP3 server, were sporadically being dropped and were not reaching their destination. We let this run for about 20 minutes, and then canceled the POP3 download. Upon referring to Richard Stevens' book, we attributed this behavior to TCP's retransmission timer and its built-in exponential back off delays.

We knew that our LAN was not heavily loaded. We speculated that possibly some Cisco router buffers were configured to be too small and were getting full. We checked this via the IOS during a POP3 attempt. We ran the commands show memory and show buffers. This showed that we were barely using any of the maximum capacity of the Cisco 2621's memory or buffers. Next, we attempted to create an exact duplicate LAN connection. We moved galipoli to the 10-mbps switch that another NT 4.0 server, atacama, was connected to without any problems. Once the switch port was moved, the POP3 large email download problem went away, but we noticed another problem. The total download time for the large email seemed abnormally slow. This became more apparent, especially when compared to the non-NAT'ed POP3 downloads.

Upon further investigation with the NT Network Monitor, sluggish TCP/IP transfer times were consistent in all of our NAT'ed clients during large POP3 downloads. The download slowness had been overlooked, because we seldom received large emails. Our POP3 connections had always eventually completed within a reasonable amount of time. It was simply a coincidence that galipoli had been plugged into a 100-mbps switch in the first place. When galipoli was moved back to the 100-mbps switch, the original POP3 download issue immediately reappeared. When galipoli was given a non-NAT'ed IP address (while still connected to the 3COM 100-mbps switches) the sluggish TCP/IP transfer times disappeared, and the downloads were successful

This problem existed across the entire NAT subnet. A single network device, or more likely, a combination of several of the devices in Figure 1 was responsible for creating the network problems. We had looked at all of the "usual suspects" with no luck. At this point, a senior member of the IT team theorized that the problem had more to do with large data transfers in general, rather than issues specific to POP3 email downloads. We verified this theory by attempting to FTP a large file from our POP3 email server down to one of our NAT'ed clients. The results were as follows:

A non-NAT'ed client attached to 100-mbps 3COM switch (C) would FTP the file very quickly with a high throughput.
A non-NAT'ed client attached to 10-mbps 3COM switch (B) would FTP the file very quickly also with a high throughput.
A NAT'ed client attached to 10-mbps 3COM switch (B) would FTP the large file, but slower than the non-NAT'ed client.
A NAT'ed client attached to 100-mbps 3COM switch (C) consistently could not FTP the large file. We saw similar exponential increasing time delays for packet transmissions.

We decided to do some more packet sniffing. This time, we looked at one port of the firewall's quad Ethernet card adapter, /dev/qfe2. We used the Solaris snoop command snoop -d qfe2 -v -x 0. We captured POP3 packets from several sessions, and attempted to compare the packet data in 10-mbps switch transfers versus 100-mbps switch transfers. The captured output was strikingly similar, up to the point where the 100-mbps connection had quickly given way to exponential TCP/IP delays, and ACKs were not delivered.

After finding no additional clues in the packets, we focused on the Cisco router. The Cisco IOS provides excellent statistics reporting tools for its Ethernet interfaces. In the IOS, we turned on Ethernet debugging, with the command debug Ethernet. This displayed countless amounts of Ethernet packet information to the Cisco console. There were no noticeable Ethernet errors reported. We now felt strongly that the network issues existed at the Ethernet level between the Cisco and the 3COM switches. We turned our primary focus to the Cisco Ethernet port, Fast Ethernet 0/0. We performed a show interface FastEthernet 0/0. This statistics dump presented no relevant errors. We then examined the Cisco Ethernet interface, Ethernet 1/0. Upon the command show interface Ethernet 1/0, we immediately spotted a large number of CRC input errors:

2621#show interface Ethernet 1/0
Ethernet1/0 is up, line protocol is up
  Hardware is AmdP2, address is 0003.6bba.4290 (bia 0003.6bba.4290)
  Internet address is 207.238.210.226/28
  MTU 1500 bytes, BW 10000 Kbit, DLY 1000 usec,
     reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:03:18, output 00:00:03, output hang never
  Last clearing of "show interface" counters never
  Queueing strategy: fifo
  Output queue 0/40, 0 drops; input queue 0/75, 0 drops
  5 minute input rate 0 bits/sec, 0 packets/sec
  5 minute output rate 0 bits/sec, 0 packets/sec
     4427672 packets input, 4065823311 bytes, 0 no buffer
     Received 3486 broadcasts, 0 runts, 0 giants, 0 throttles
     525 input errors, 525 CRC, 1 frame, 0 overrun, 0 ignored
     0 input packets with dribble condition detected
     3899075 packets output, 419300185 bytes, 0 underruns
     100 output errors, 34848 collisions, 25 interface resets
     0 babbles, 0 late collision, 26313 deferred
     42 lost carrier, 0 no carrier
     0 output buffer failures, 0 output buffers swapped out

We ran another ftp attempt between galipoli and the POP3 server, and subsequently, the CRC counters, on Ethernet 1/0, continued to rise. We wondered why we were getting CRC errors, and thought maybe one of our category 5 cables was defective. It did not help to replace any of the LAN cables. Digging further, we examined the fast Ethernet interface, /dev/qfe2, on our Sun Solaris firewall. On the Sun, we executed a netstat -k qfe2. The following detailed Ethernet statistics were displayed for the interface /dev/qfe2.

firewall# netstat -k qfe2
qfe2:
ipackets 7262169 ierrors 126 opackets 7685391 oerrors 28174 collisions 96857
defer 0 framing 70 crc 27 sqe 0 code_violations 0 len_errors 14
drop 14 buff 0 oflo 0 uflo 0 missed 1 tx_late_collisions 27959
retry_error 0 first_collisions 0 nocarrier 56 inits 113 nocanput 0
allocbfail 0 runt 0 jabber 0 babble 215 tmd_error 0 tx_late_error 0
rx_late_error 0 slv_parity_error 0 tx_parity_error 0 rx_parity_error 0
slv_error_ack 0 tx_error_ack 0 rx_error_ack 0 tx_tag_error 0
rx_tag_error 0 eop_error 0 no_tmds 0 no_tbufs 0 no_rbufs 0
rx_late_collisions 0 rbytes 2647762907 obytes 1926887676 multircv 125 multixmt 0
brdcstrcv 480091 brdcstxmt 4944 norcvbuf 0 noxmtbuf 58

Solaris reported a high number of "transmit late collisions" errors on /dev/qfe2. In comparison, the other three Ethernet interfaces on the Sun quad card had transmit late collision counters of zero. We did some Internet newsgroup archive searching, and found several discussions about incorrect Ethernet duplex settings causing CRC and late collision problems.

We knew that the Sun quad card was capable of running 100-mbps full duplex. The entire card had been set during initial installation to autonegotiate its duplex. We verified this with the Solaris /usr/sbin/ndd command, then moved back to the Cisco router. Via the Cisco IOS, we looked at the Ethernet interface, Ethernet 1/0. It was also set to full duplex. We assumed that since packets were successfully being transferred between this pair of Ethernet ports, they must have had mutually agreed upon a duplex speed. We tried setting Ethernet 1/0 on the Cisco to be autonegotiate. Surprisingly, this was not an option for this specific Cisco 10-mbps port. The port required itself to be set to either full or half duplex. Meanwhile, the other Cisco interface, Fast Ethernet 0/0 could be set to the autonegotiate duplex option. We changed Cisco Ethernet interface, Ethernet 1/0, to half duplex, and re-ran our ftp test from galipoli to the POP3 server. Galipoli was still set up with a NAT'ed IP address, and was also still directly connected to 3COM 100-mbps switch (C). We were surprised to see an immediate difference. The ftp finished with lightning-fast speed. Our entire problem disappeared. When we switched the Cisco's Ethernet interface, Ethernet 1/0, back to full-duplex mode, the problem reappeared.

We finally decided that the network issues were resulting from a duplex issue between the Sun Ethernet interface, /dev/qfe2, and the Cisco Ethernet interface, Ethernet 1/0. Upon further research, we found that the Cisco router's 10-mbps Ethernet port, Ethernet 1/0, could only communicate in full duplex if the other end of its connection was also hard wired for full duplex and not to autonegotiate.

Our initial suspicion of a problem specific to the NT computer galipoli, and Ethernet issues related to the 100-mpbs 3COM switches were incorrect. The Cisco IOS NAT configuration had been installed properly and was functioning as expected. But, by adding the Cisco 2621 router to the LAN (to provide the NAT functionality), we also introduced the Cisco router's accompanying Ethernet ports to the LAN. During initial setup of the Cisco 2621, we did not notice the limitations of the Cisco 10-mbps Ethernet card, and its inability to operate in an autonegotiate duplex mode. As a result, the Cisco IOS configuration of Ethernet 1/0 was in error, and was the sole cause of the LAN problems. If galipoli had not been accidentally plugged into the 100-mbps 3COM switch, it may not have taken us much longer to realize that the Cisco duplex setting was in error, and causing ongoing LAN problems. Instead, having galipoli connected to the 100-mbps switch increased the overall LAN speed to a faster throughput where the packet collisions were frequent enough to cause the noticeable POP3 problems originally reported by the user.

We learned a great deal during this entire multi-vendor troubleshooting process. We were satisfied with the result and were glad that the problem had appeared during the initial stages of our NAT implementation. We were also happy that we actually fixed the problem, rather than having to create a "workaround" solution.

Tom Podnar has been a Systems Infrastructure Project Manager and Systems Administrator for more than 10 years. He has a Masters degree in Computer Science. When not working, he's at the lake windsurfing. Tom can be contacted at: tompodnar@yahoo.com.