Multi-Vendor
LAN Troubleshooting
Tom Podnar
During an infrastructure planning session, our IT department realized
that we were running out of IP addresses more quickly than anticipated.
Given the current rate of expansion, we estimated that we would
be out of usable IP addresses within 90 days. We Ddecided a NAT
(Network Address Translation) implementation would remedy the situation
and selected a Cisco router-based solution as the foundation of
our NAT LAN upgrade project.
We quickly designed several new growth phases to our existing
LAN, and purchased a Cisco 2621 router for the project. Within days,
we integrated the Cisco 2621 into our existing LAN infrastructure
and placed several workstations "behind" the NAT router
for testing. There were no user complaints after two weeks. Everything
appeared to be functioning as expected. We were pleased with the
results, and intended to increase the beta program with an additional
group of computers.
LAN trouble, however, often appears when least expected. A member
of our IT staff arrived one day and logged into his NT 4.0 (SP 6a)
client named galipoli. He checked his email using Netscape Communicator
4.7. When his computer contacted our corporate Red Hat Linux POP3
server, the "Receiving Messages" status box appeared,
downloaded the first two messages, and then persisted on the screen.
No error messages were generated, and the session never timed out.
Each POP3 re-attempt was met with the same result. He verified that
his Netscape client user profile was configured properly. Next,
he verified that his network connectivity was fine by pinging the
POP3 server. The configuration and connection all seemed okay. At
this point, he brought the issue to the attention to the rest of
the IT team. There were no reports of similar problems among other
users. Therefore, we thought that the situation was probably an
issue with that specific client's configuration.
We examined the user's mail spool file on the email server,
and discovered that the third message was more than 1 MB. In the
past, we had experienced situations, usually for dial-up users,
where a large email message would cause similar problems. We deleted
the third message in the mail spool, and the client was able to
POP all of the mail off. However, the troubling "third"
message had been sent to the entire IT staff, and no one else on
the team had reported any problems downloading it. To verify that
the problem still existed, we sent another large email to his email
address and copied the entire IT team. Everyone was able to POP
the mail message off, except that user's NT client, galipoli.
For secondary verification of connectivity, we performed a telnet
mail 110, at the same time that the Netscape mail client was
"hanging" on the POP3 download attempt. The telnet
attempt to the mail server's POP3 port was successful. We still
thought the problem was local to galipoli, so we re-installed the
Netscape Communicator software on it. This did not result in a solution.
Several members of the IT staff had just finished reading TCP/IP
Illustrated, Volume 1, by W. Richard Stevens (Addison-Wesley),
so we were eager to put our additional TCP/IP know-how to work for
us. As illustrated by Figure 1, the only difference between galipoli
and the other IT team member's workstations was that galipoli
was connected to a 3COM 100-mbps unmanaged switch. This was the
case because 3COM switch (C), had just been returned to us from
3COM on a warranty repair. We were in the process of verifying that
the switch was operational, so the switch, 3COM (C), was not located
in the wiring closet, but at the user's desk. We immediately
thought that 3COM switch (C) was having some problems. We removed
it from the network path, and tried to POP a large email again.
The same symptoms appeared. At this point, we were actually unaware
that galipoli was now directly plugged into 3COM switch (A). This
specific 100-mbps switch, 3COM switch (A), was usually reserved
for production servers. The removal of 3COM switch (C) should have
resulted in galipoli being directly plugged into 3COM switch (B).
Instead, due to an oversight, and a shortage of free 10-mbps ports,
galipoli's LAN port had been directly connected to 3COM switch
(A) in the wiring closet.
The only good news at this point was that the problem was consistent.
A consistent network problem is a "good problem", because
it's usually easier to track down. The architecture of the
LAN had not changed for almost a year, except for the recent addition
of the Cisco 2621 and its associated NAT subnet. Galipoli had been
assigned a NAT'ed IP address. We had other workstations on
the same NAT subnet as galipoli, which had no problems downloading
large emails from the POP3 server. Because network problems can
commonly be tracked down with a "what changed last" philosophy,
NAT was our prime suspect. We changed galipoli's IP address
back to being non-NAT'ed, but still physically cross-connected
through 3COM 100-mbps switches (A) and (C). The problem vanished
immediately. When galipoli was moved back to a NAT'ed IP address,
the problem immediately reappeared. The only difference was that
when galipoli had a non-NAT'ed IP address, its packets had
a direct route to the firewall and were bypassing the Cisco 2621
router.
We examined the NAT configuration on the Cisco 2621(IOS version
12.1(6)) and verified that it was correct. To further debug, in
the Cisco IOS, we ran the command, show IP NAT translations.
This showed that a NAT translation table entry was created for the
POP3 connection:
cisco-2621>show ip nat translations
Pro Inside global Inside local Outside local Outside global
--- 207.238.210.231 172.16.10.9 --- ---
tcp 207.238.210.230:2136 172.16.10.10:2136 207.238.210.5:110 207.238.210.5:110
tcp 207.238.210.230:2137 172.16.10.10:2137 207.238.210.5:110 207.238.210.5:110
Next, we ran a show IP NAT statistics, with the following result:
cisco-2621>show ip nat statistics
Total active translations: 5 (1 static, 4 dynamic; 4 extended)
Outside interfaces:
Ethernet1/0
Inside interfaces:
FastEthernet0/0
Hits: 7380067 Misses: 155877
Expired translations: 155969
Dynamic mappings:
-- Inside Source
access-list 1 pool cbsw-pool refcount 4
pool cbsw-pool: netmask 255.255.255.240
start 207.238.210.230 end 207.238.210.230
type generic, total addresses 1, allocated 1 (100%), misses 1386
We noticed that the misses count seemed high, and it kept increasing
with each POP3 session attempt. To find out exactly what the misses
counter was used for, we turned to the Cisco technical support Web
site. We found that it was normal for the misses counter to increase.
Since each POP3 session was a new NAT'ed network connection initiation
through the Cisco 2621, there would not be a pre-existing corresponding
NAT table entry. Therefore, each new network connection caused a miss,
and was a normal part of the NAT operation. From this, we concluded
that there were no obvious NAT-related issues causing our problems.
Galipoli was running NT server 4.0, so we installed the Network
Monitor Tools Service (packet sniffer) via the NT Control Panel.
From galipoli, we started capturing TCP/IP packets, and initiated
a POP3 download of a large email. The TCP/IP session was established
correctly (SYN, SYN-ACK, ACK). We then observed that the packet
flow containing the data of the email message started, paused for
a second, restarted transmission, paused for about 2 seconds, restarted,
paused for about 3 seconds, restarted, paused for 6 seconds, restarted,
and then paused for about 12 seconds. The pause duration stayed
at a consistent delay length for about a minute, and then began
an exponential increase. We also noticed that the POP3 server's
TCP sequence numbers kept gradually increasing, so it appeared that
data was slowly getting through. But, we also noticed that ACKs,
generated by both galipoli and our POP3 server, were sporadically
being dropped and were not reaching their destination. We let this
run for about 20 minutes, and then canceled the POP3 download. Upon
referring to Richard Stevens' book, we attributed this behavior
to TCP's retransmission timer and its built-in exponential
back off delays.
We knew that our LAN was not heavily loaded. We speculated that
possibly some Cisco router buffers were configured to be too small
and were getting full. We checked this via the IOS during a POP3
attempt. We ran the commands show memory and show buffers.
This showed that we were barely using any of the maximum capacity
of the Cisco 2621's memory or buffers. Next, we attempted to
create an exact duplicate LAN connection. We moved galipoli to the
10-mbps switch that another NT 4.0 server, atacama, was connected
to without any problems. Once the switch port was moved, the POP3
large email download problem went away, but we noticed another problem.
The total download time for the large email seemed abnormally slow.
This became more apparent, especially when compared to the non-NAT'ed
POP3 downloads.
Upon further investigation with the NT Network Monitor, sluggish
TCP/IP transfer times were consistent in all of our NAT'ed
clients during large POP3 downloads. The download slowness had been
overlooked, because we seldom received large emails. Our POP3 connections
had always eventually completed within a reasonable amount of time.
It was simply a coincidence that galipoli had been plugged into
a 100-mbps switch in the first place. When galipoli was moved back
to the 100-mbps switch, the original POP3 download issue immediately
reappeared. When galipoli was given a non-NAT'ed IP address
(while still connected to the 3COM 100-mbps switches) the sluggish
TCP/IP transfer times disappeared, and the downloads were successful
This problem existed across the entire NAT subnet. A single network
device, or more likely, a combination of several of the devices
in Figure 1 was responsible for creating the network problems. We
had looked at all of the "usual suspects" with no luck.
At this point, a senior member of the IT team theorized that the
problem had more to do with large data transfers in general, rather
than issues specific to POP3 email downloads. We verified this theory
by attempting to FTP a large file from our POP3 email server down
to one of our NAT'ed clients. The results were as follows:
- A non-NAT'ed client attached to 100-mbps 3COM switch (C)
would FTP the file very quickly with a high throughput.
- A non-NAT'ed client attached to 10-mbps 3COM switch (B)
would FTP the file very quickly also with a high throughput.
- A NAT'ed client attached to 10-mbps 3COM switch (B) would
FTP the large file, but slower than the non-NAT'ed client.
- A NAT'ed client attached to 100-mbps 3COM switch (C) consistently
could not FTP the large file. We saw similar exponential increasing
time delays for packet transmissions.
We decided to do some more packet sniffing. This time, we looked
at one port of the firewall's quad Ethernet card adapter, /dev/qfe2.
We used the Solaris snoop command snoop -d qfe2 -v -x 0.
We captured POP3 packets from several sessions, and attempted to
compare the packet data in 10-mbps switch transfers versus 100-mbps
switch transfers. The captured output was strikingly similar, up
to the point where the 100-mbps connection had quickly given way
to exponential TCP/IP delays, and ACKs were not delivered.
After finding no additional clues in the packets, we focused on
the Cisco router. The Cisco IOS provides excellent statistics reporting
tools for its Ethernet interfaces. In the IOS, we turned on Ethernet
debugging, with the command debug Ethernet. This displayed
countless amounts of Ethernet packet information to the Cisco console.
There were no noticeable Ethernet errors reported. We now felt strongly
that the network issues existed at the Ethernet level between the
Cisco and the 3COM switches. We turned our primary focus to the
Cisco Ethernet port, Fast Ethernet 0/0. We performed a show interface
FastEthernet 0/0. This statistics dump presented no relevant
errors. We then examined the Cisco Ethernet interface, Ethernet
1/0. Upon the command show interface Ethernet 1/0, we immediately
spotted a large number of CRC input errors:
2621#show interface Ethernet 1/0
Ethernet1/0 is up, line protocol is up
Hardware is AmdP2, address is 0003.6bba.4290 (bia 0003.6bba.4290)
Internet address is 207.238.210.226/28
MTU 1500 bytes, BW 10000 Kbit, DLY 1000 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:03:18, output 00:00:03, output hang never
Last clearing of "show interface" counters never
Queueing strategy: fifo
Output queue 0/40, 0 drops; input queue 0/75, 0 drops
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
4427672 packets input, 4065823311 bytes, 0 no buffer
Received 3486 broadcasts, 0 runts, 0 giants, 0 throttles
525 input errors, 525 CRC, 1 frame, 0 overrun, 0 ignored
0 input packets with dribble condition detected
3899075 packets output, 419300185 bytes, 0 underruns
100 output errors, 34848 collisions, 25 interface resets
0 babbles, 0 late collision, 26313 deferred
42 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
We ran another ftp attempt between galipoli and the POP3 server,
and subsequently, the CRC counters, on Ethernet 1/0, continued to
rise. We wondered why we were getting CRC errors, and thought maybe
one of our category 5 cables was defective. It did not help to replace
any of the LAN cables. Digging further, we examined the fast Ethernet
interface, /dev/qfe2, on our Sun Solaris firewall. On the Sun,
we executed a netstat -k qfe2. The following detailed Ethernet
statistics were displayed for the interface /dev/qfe2.
firewall# netstat -k qfe2
qfe2:
ipackets 7262169 ierrors 126 opackets 7685391 oerrors 28174 collisions 96857
defer 0 framing 70 crc 27 sqe 0 code_violations 0 len_errors 14
drop 14 buff 0 oflo 0 uflo 0 missed 1 tx_late_collisions 27959
retry_error 0 first_collisions 0 nocarrier 56 inits 113 nocanput 0
allocbfail 0 runt 0 jabber 0 babble 215 tmd_error 0 tx_late_error 0
rx_late_error 0 slv_parity_error 0 tx_parity_error 0 rx_parity_error 0
slv_error_ack 0 tx_error_ack 0 rx_error_ack 0 tx_tag_error 0
rx_tag_error 0 eop_error 0 no_tmds 0 no_tbufs 0 no_rbufs 0
rx_late_collisions 0 rbytes 2647762907 obytes 1926887676 multircv 125 multixmt 0
brdcstrcv 480091 brdcstxmt 4944 norcvbuf 0 noxmtbuf 58
Solaris reported a high number of "transmit late collisions"
errors on /dev/qfe2. In comparison, the other three Ethernet
interfaces on the Sun quad card had transmit late collision counters
of zero. We did some Internet newsgroup archive searching, and found
several discussions about incorrect Ethernet duplex settings causing
CRC and late collision problems.
We knew that the Sun quad card was capable of running 100-mbps
full duplex. The entire card had been set during initial installation
to autonegotiate its duplex. We verified this with the Solaris /usr/sbin/ndd
command, then moved back to the Cisco router. Via the Cisco IOS,
we looked at the Ethernet interface, Ethernet 1/0. It was also set
to full duplex. We assumed that since packets were successfully
being transferred between this pair of Ethernet ports, they must
have had mutually agreed upon a duplex speed. We tried setting Ethernet
1/0 on the Cisco to be autonegotiate. Surprisingly, this was not
an option for this specific Cisco 10-mbps port. The port required
itself to be set to either full or half duplex. Meanwhile, the other
Cisco interface, Fast Ethernet 0/0 could be set to the autonegotiate
duplex option. We changed Cisco Ethernet interface, Ethernet 1/0,
to half duplex, and re-ran our ftp test from galipoli to
the POP3 server. Galipoli was still set up with a NAT'ed IP
address, and was also still directly connected to 3COM 100-mbps
switch (C). We were surprised to see an immediate difference. The
ftp finished with lightning-fast speed. Our entire problem
disappeared. When we switched the Cisco's Ethernet interface,
Ethernet 1/0, back to full-duplex mode, the problem reappeared.
We finally decided that the network issues were resulting from
a duplex issue between the Sun Ethernet interface, /dev/qfe2,
and the Cisco Ethernet interface, Ethernet 1/0. Upon further research,
we found that the Cisco router's 10-mbps Ethernet port, Ethernet
1/0, could only communicate in full duplex if the other end of its
connection was also hard wired for full duplex and not to autonegotiate.
Our initial suspicion of a problem specific to the NT computer
galipoli, and Ethernet issues related to the 100-mpbs 3COM switches
were incorrect. The Cisco IOS NAT configuration had been installed
properly and was functioning as expected. But, by adding the Cisco
2621 router to the LAN (to provide the NAT functionality), we also
introduced the Cisco router's accompanying Ethernet ports to
the LAN. During initial setup of the Cisco 2621, we did not notice
the limitations of the Cisco 10-mbps Ethernet card, and its inability
to operate in an autonegotiate duplex mode. As a result, the Cisco
IOS configuration of Ethernet 1/0 was in error, and was the sole
cause of the LAN problems. If galipoli had not been accidentally
plugged into the 100-mbps 3COM switch, it may not have taken us
much longer to realize that the Cisco duplex setting was in error,
and causing ongoing LAN problems. Instead, having galipoli connected
to the 100-mbps switch increased the overall LAN speed to a faster
throughput where the packet collisions were frequent enough to cause
the noticeable POP3 problems originally reported by the user.
We learned a great deal during this entire multi-vendor troubleshooting
process. We were satisfied with the result and were glad that the
problem had appeared during the initial stages of our NAT implementation.
We were also happy that we actually fixed the problem, rather than
having to create a "workaround" solution.
Tom Podnar has been a Systems Infrastructure Project Manager
and Systems Administrator for more than 10 years. He has a Masters
degree in Computer Science. When not working, he's at the lake
windsurfing. Tom can be contacted at: tompodnar@yahoo.com.
|