Troubleshooting
SolarisTM Network Performance
Alex Golomshtok
Networks are the bloodstreams of modern computer systems. Today,
nearly all computers are connected to some kind of public or private
network, and it is difficult to imagine a system without at least
some sort of networking capabilities. As computer technology continues
to evolve, the distributed computing model gains more ground, thus
increasing the importance of networks. In fact, today most organizations
rely on their own complex networking structures so much that even
a short period of downtime may easily translate into millions of
dollars of lost revenues.
Modern-day networks are often monstrously complex, convoluted,
and rely on a wide spectrum of technologies. A typical corporate
network, for instance, may bring together thousands of computer
systems from different hardware vendors, running various operating
systems. Monitoring the health of such network is quite a challenge
and may be impossible without the proper tools. To satisfy growing
demands for reliable management of heterogeneous networks, the Simple
Network Management Protocol (SNMP) was developed and adopted as
a management standard for TCP/IP-based networking systems. SNMP
quickly gained popularity and remains the primary mechanism for
carrying out a multitude of network management tasks, such as network
performance monitoring, fault management, configuration management,
and more.
SNMP
The foundation of SNMP is the database containing the management
data, on which the network management system operates. This database
is commonly referred to as the Management Information Base (MIB).
SNMP MIB is essentially a tree-like collection of objects, each
representing a managed resource on a network. A network management
system can monitor the state of these objects by reading their properties
and alter the state by modifying these properties. The organization
of an MIB is governed by a standard, called Structure of Management
Information (SMI) [1] -- it outlines the rules for constructing
and defining MIB management objects. Over the years, a few different
MIBs have been developed to address various aspects of network and
system management, such as Relational Database Monitoring MIB and
Mail Management MIB. MIB-II [2], which defines the second version
of the management information base for TCP/IP-based Internets, however,
remains perhaps the most important and the most commonly used MIB
specification. MIB-II defines following broad groups of management
information:
- System -- General information about the networked system,
such as its identification information, location, uptime, etc.
- Interfaces -- Information, describing each of the system's
network interfaces.
- AT -- Information pertinent to the operations of an address
translation (AT) protocol; essentially the contents of the address
translation table.
- IP -- Information pertinent to the operations of IP protocol
on a given system.
- ICMP -- Information pertinent to the operations of ICMP
protocol.
- TCP -- Information pertinent to the operations of TCP protocol.
- UDP -- Information pertinent to the operations of UDP protocol.
- EGP -- Information pertinent to the operations of EGP protocol.
- DOT3 -- Information pertinent to the transmission schemes
and access protocols at each system interface.
- SNMP -- Information pertinent to the operations of SNMP
protocol on a given system.
Apparently, MIB-II covers many aspects of TCP/IP-based network
management and allows for building comprehensive management systems.
However, there is one question that MIB specifications do not quite
answer -- where does the management data come from? SNMP, as
powerful and flexible as it is, is just a mechanism for disseminating
and sometimes altering the management information and at no time
it is responsible for actually collecting and maintaining the data.
Streams
The answer lies with TCP/IP stack. Since System 5, Release 3 (SVR3),
UNIX has been equipped with the streams [3] mechanism -- elegant
and flexible framework for UNIX System communication services. In
the true spirit of UNIX, the streams model encourages the development
of compact modules, representing functional components such that
these modules can then be dynamically loaded and interconnected
to form a fully functional data communication path or stream. Streams
closely resemble the layered structure of typical networking protocols,
and therefore are perfect for implementing protocol stacks.
A stream is a communication link between the user space (or application
program) and the kernel. Typically, an application will create a
stream by opening a streams device, such as /dev/ip, for instance.
When a stream is opened, it consists of a stream head -- the
interface between the stream and the user process, and a stream
driver. An application process may then "push" various
modules onto the stream thus enabling certain services. Each stream
module is typically responsible for carrying out a set of closely
related functional tasks, such as adding network routing information
to user packets. Once a stream is assembled, an application may
initiate a bi-directional data exchange or stream I/O.
The data is passed through the stream in the form of messages.
When a user process passes a message to the stream head, this message
is sent from module to module until it reaches the bottom of the
stack -- the stream driver. In this case, the message is said
to be traveling downstream. Whenever the kernel replies, the data
travels upstream -- each stream module passes the message to
the module above it until it reaches the stream head. In case of
the TCP/IP stack, not only the data but also control messages can
be sent downstream to either alter the behavior of the stream or
retrieve some sort of management information maintained by the stream
modules.
One control message that can be sent downstream is the option
management request. Typically, the option management request is
delivered to a specific module on the stream, but when retrieving
MIB data, all stream modules receive the request at once, and the
entire universe of operational data, pertinent to all stream modules,
is returned to the application program.
Under Solaris, the majority of management information, described
by MIB-II, can be retrieved directly from the stream using a sequence
of ioctl(2) calls. Many network monitoring utilities, such as netstat(1M)
and most SNMP agents, employ streams option management requests
to gather network statistics. These programs typically construct
a brand new stream for the purpose of obtaining management data,
configure it by pushing the appropriate modules onto the streams
head, send an option management request downstream, and then extract
the statistical data from the returned message.
Frankly, SNMP and a slew of Solaris network monitoring utilities
solve most of the network monitoring problems while successfully
hiding the complexities of streams programming from typical systems
administrators. There are, however, some situations where more control
or more flexibility is desired. Netstat(1M), for instance, although
convenient and easy to use, remains just another program that produces
textual output. This makes it difficult to use netstat as a basis
for custom monitoring solutions.
Although it is certainly possible to use some shell magic to extract
the values of network counters from netstat output, this approach
is not very reliable, inefficient, and just plain ugly. Another
problem is that every invocation of netstat results in a new OS
process being created, thus consuming precious system resources.
Most monitoring tools take periodic snapshots of the statistical
data and calculate deltas over a predefined time interval, so a
shell script that launches netstat every time it needs a sample
of network statistics, makes for a very inefficient and expensive
monitoring tool.
SNMP solves most of these problems -- there are numerous APIs
and tools, such as the excellent Net::SNMP Perl module [4], which
could be used to read and even modify the network management information
in a fairly painless fashion. But even SNMP is not perfect. First,
every SNMP-based tool relies on an SNMP agent, which has to be running
on every managed computer system. Second, although "simple"
is part of its name, SNMP is not quite that simple -- programming
SNMP client applications can get fairly involved. Third, there are
certain security implications when running SNMP agents on computing
nodes, connected to public networks. If not configured correctly,
SNMP can provide a wealth of information about your network to a
potential intruder.
Solaris::MIB2
In an attempt to solve some of these problems, we created a simple
yet powerful Perl extension called Solaris::MIB2 [5]. This module
allows easy access to most of the statistical and operational data
maintained by Solaris stream modules, while imposing only a minimal
load on a monitored system. The following few lines of code demonstrate
how easy it is to obtain a value of an arbitrary network counter
-- for example, tcpCurrEstab or current number of TCP connections
in an established state:
use Solaris::MIB2;
$mib = new Solaris::MIB2("/dev/tcp");
print $mib->{tcp}->{tcpCurrEstab}, "\n";
Apparently, all we have to do is create an instance of Solaris::MIB2
object passing "/dev/tcp" as a parameter (so that the module
builds the stream over /dev/tcp device) and use the returned hash
reference to read the value of interest.
As this article will show, many compact and powerful network monitors
can be developed with Solaris::MIB2, although the module has a few
limitations, which users should be aware of. The first, and perhaps
most severe, shortcoming of Solaris::MIB2 is that it, unlike SNMP,
cannot read the management data from a remote computer over the
network. Although it is possible to create a custom network-based
data distribution mechanism, this was not our intention. Those who
look for this kind of functionality should turn to SNMP.
Yet another limitation, which is a result of a conscious design
decision, is read-only access to the management data exposed by
Solaris::MIB2. Unlike SNMP, which is general-purpose network management
facility, Solaris::MIB2 is intended solely for the purposes of network
monitoring and, as such, does not allow for any modification of
the data on which it operates. While Solaris::MIB2 provides access
to most of the management information, described in MIB2 RFC [2],
it is not fully compliant with the specification and does not implement
some of the groups, such as System, SNMP, or DOT3. In fact, the
module exposes most of the structures defined in /usr/include/inet/mib2.h
for the exception of some IPv6 tables.
Finally, the interface to the MIB2 data is not implemented as
a tied hash -- in other words, reading a value from the MIB2
object will not trigger the option management request to be sent
downstream. Instead, when the object is first created, the stream
module statistics are read and loaded into the regular hierarchical
hash. Every subsequent refresh operation must be initiated explicitly
using the update function, which is the part of Solaris::MIB2 interface:
use Solaris::MIB2;
$mib = new Solaris::MIB2("/dev/tcp");
while( 1 ) {
sleep(5);
$mib->update();
print $mib->{tcp}->{tcpCurrEstab}, "\n";
}
As mentioned earlier, reading MIB2 statistics is all-or-none proposition
-- it is impossible to retrieve the values of individual variables
and, whenever an option management request is sent downstream, the
operational data for all stream modules is returned. Apparently, this
particular feature of MIB2 interface makes tied-hash implementation
prohibitively expensive.
To demonstrate the power and flexibility of Solaris::MIB2, I've
provided a few simple examples, designed to illustrate how the functionality
afforded by this module can be applied to real-world network monitoring
problems. The first sample program, called pif, attempts to mimic
some of the functionality of the popular UNIX utility arp(1M). The
arp(1M) program displays and modifies the contents of the Internet-to-Ethernet
address resolution tables, used by the address resolution protocol
[6]. For the sake of saving space, the capabilities of this pif
program will be limited to printing the contents of the address
translation or Net-to-Media table, which is an equivalent of running
the arp(1M) utility with -a command-line switch.
The following is a complete source code listing of pif:
1 #!/usr/local/bin/perl
2
3 use Socket;
4 use Solaris::MIB2;
5
6 $mib = new Solaris::MIB2( "/dev/ip" );
7
8 print "Device IP Address Mask Flags Phys Address\n";
9 print "------ --------------- --------------- ------ ------------------\n";
10
11 foreach my $entry ( @{$mib->{ipNetToMediaEntry}} ) {
12 my $device = $entry->{ipNetToMediaIfIndex};
13 my $host = gethostbyaddr( inet_aton($entry->{ipNetToMediaNetAddress}),AF_INET) ||
14 $entry->{ipNetToMediaNetAddress};
15 my $flags = ($entry->{ntm_flags} & Solaris::MIB2::ACE_F_PERMANENT) ? "S" : "";
16 $flags .= ($entry->{ntm_flags} & Solaris::MIB2::ACE_F_PUBLISH) ? "P" : "";
17 $flags .= !($entry->{ntm_flags} & Solaris::MIB2::ACE_F_RESOLVED) ? "U" : "";
18 $flags .= ($entry->{ntm_flags} & Solaris::MIB2::ACE_F_MAPPING) ? "M" : "";
19
20 my $mask = sprintf("%u.%u.%u.%u",
21 map( hex("0x$_"), unpack("A2A2A2A2", $entry->{ntm_mask})));
22 my $phys = $entry->{ipNetToMediaPhysAddress};
23
24 printf("%-6s %-15s %-15s %-6s %-20s\n", $device, $host, $mask, $flags, $phys);
25 };
The script uses two extension modules -- Solaris::MIB2, loaded
at line 4; and Socket, loaded at line 3. The Socket module exposes
the inet_aton function, necessary for converting the character
representation of host IP addresses to struct in_addr structure, which
can be consumed by gethostbyaddr(3NSL) function.
Line 6 constructs a brand new MIB2 object over /dev/ip by passing
"/dev/ip" to the constructor function of Solaris::MIB2.
Note that, under Solaris, read/write access to /dev/ip is limited
to root and members of sys group, therefore, if not run by
a privileged user, our script will fail. Always using root or other
special user id to run the script is not very convenient, so making
the script set-group-id sys seems like the best solution.
Many UNIX programs, such as passwd(1) or netstat(1M), are set-user-id
or set-group-id, which allows regular users to perform operations
that are typically permitted only to root or other privileged users.
Set-user-id and set-group-id programs are controversial, as many
people consider them inherently unsafe. However, if configured correctly,
these programs provide convenient solutions for many otherwise unsolvable
problems.
Our program, however, is a script, interpreted at run time by
Perl, as opposed to a binary executable such as netstat(1M), which
makes it more of a security concern. First, the script's source
code is more readily accessible and, thus can easily be examined
for security vulnerabilities by a potential intruder. But most importantly,
some UNIX kernels, especially the older ones, have a security problem
with set-user-id and set-group-id scripts. When a user executes
a file, where the first line starts with #!path_to_interp,
the kernel translates this into an exec(2) call, invoking the interpreter,
which is identified by path_to_interp and passing the original
script file name and other arguments as parameters. For example,
if the script "/usr/local/bin/foo" starts with #!/bin/ksh
and is invoked as follows:
/usr/local/bin/foo arg1 arg2 arg3
the kernel will actually execute the following command:
/bin/ksh /usr/local/bin/foo arg1 arg2 arg3
Now let's consider the following scenario: a user makes a symbolic
link /tmp/foo_link pointing to /usr/local/bin/foo. In
this case, the kernel executes the following command:
/bin/ksh /tmp/foo_link arg1 arg2 arg3
There is a window between the time the kernel opens the script file
to determine what must be executed and the time when the interpreter
(/bin/ksh) reopens the file to actually execute it. As small as this
window might be, there is a chance that a malicious user could modify
the symbolic link to point to a different file. Thus, if the script
is run set-user-id root, some untrustworthy code will execute with
superuser privileges.
Recent releases of Solaris close this security hole by passing
"/dev/fd/3" a special file, which is already opened over
the original script file, to the interpreter instead of the actual
path to the script file, thus eliminating any potential race conditions
and reducing the security risk. The Perl configuration script checks
whether your system supports the secure set-user-id scripts using
the following clever trick:
echo "#!/bin/ls" >reflect
chmod +x,u+s reflect
./reflect >flect 2>&1
if /bin/grep "/dev/fd" flect >/dev/null; then
echo "Congratulations, your kernel has secure setuid scripts!"
else
echo "setuid scripts are not secure!"
fi
If the Perl installation script detects that your system does not
support secure set-user-id and set-group-id scripts, it will attempt
to build a special set-user-id version of the interpreter, called
suidperl. This special executable allows Perl to emulate the set-user-id
mechanism, because it is invoked every time Perl detects the set-user-id
or set-group-id bit set on a script file. With this in mind, we can
assume that set-user-id and set-group-id Perl scripts are reasonably
secure. Thus, in order for our pif script to run correctly, it should
be made set-group-id 'sys' as follows:
chgrp sys pif
chmod g+s pif
Once the MIB2 object is successfully constructed over /dev/ip, the
script prints out the column headings at lines 8 and 9 and then starts
iterating over the contents of the Net-to-Media table, using a foreach
loop at line 11. Line 12 simple reads the device name, pointed to
by the ipNetToMediaIfIndex hash key. Lines 13 and 14 obtain
the IP address, associated with a particular address translation table
entry and attempt to look up the host name for it, using the gethostbyaddr(3NSL)
function. The next four lines of code -- 15 through 18 --
check the address translation flags using a set of predefined constants
exposed by the Solaris::MIB2 module.
As with the arp(1M) command, our script prints the following four
flags:
- 'S' or static as opposed to dynamic address translation
entry, learned through the ARP protocol.
- 'P' or published. This means that ARP should respond
to requests for the indicated host coming from other machines.
Published entries include those explicitly added with the arp(1M)
'-s' command-line switch as well as the entry for the
local machine.
- 'U' or unresolved. Unresolved entries are those where
ARP response has not been yet received.
- 'M' or mapping. This is a special type, used for
multicast entry 224.0.0.0.
Lines 20 and 21 read the value of the netmask for a particular
address translation table entry. Solaris::MIB2 returns the netmask
as a hex string -ffffff00, for instance. Our script translates the
hexadecimal number into a conventional dotted notation by first
breaking the string apart with unpack function, pre-pending "0x"
to each of the four resulting elements to turn them into hex strings,
subsequently converted to integers with hex function; and then putting
everything back together with sprintf function.
Line 22 simply reads the physical or MAC address, and, finally
line 24 outputs a formatted address translation entry to the screen.
When run on one of our Solaris systems, pif script produces the
following output:
Device IP Address Mask Flags Phys Address
------ --------------- --------------- ------ ------------------
hme0 sun2 255.255.255.255 08:00:20:90:c5:b6
hme0 sun5 255.255.255.255 08:00:20:81:69:c4
...
hme0 198.162.31.170 255.255.255.255 00:02:55:f4:1c:79
...
hme0 sun3 255.255.255.255 SP 08:00:20:90:cf:1c
...
hme0 224.0.0.0 240.0.0.0 SM 01:00:5e:00:00:00
which is pretty much identical to the output produced by arp -a.
The next example is a bit more useful. Instead of mimicking the functionality
of an existing program, it demonstrates how Solaris::MIB2 can be used
to build lightweight custom network monitoring solutions. The following
is a complete source code for the program, called "tcpmon":
1 #!/usr/local/bin/perl
2
3 use Solaris::MIB2 ":all";
4 use Time::HR;
5 use Getopt::Std;
6
7 # sample thresholds
8 use constant active => 2.0;
9 use constant retrans_problem => 25.0;
10 use constant listen_problem => 0.5;
11 use constant halfopen_problem => 2.0;
12 use constant outrsts_problem => 2.0;
13 use constant attempt_fails => 2.0;
14 use constant indup_problem => 25.0;
15
16 getopts( "i:h" );
17 die "usage: netmon -i<interval> -h\n"
18 if $opt_h;
19
20 $mib = new Solaris::MIB2 q(/dev/tcp);
21 die "failed to create instance of MIB2 object\n"
22 unless $mib;
23
24 $now = undef;
25 $then = gethrtime();
26 %stats_now = undef;
27 %stats_then = %{$mib->{tcp}}; # ensure deep copy
28
29 while(1) {
30 sleep($opt_i||5);
31 $mib->update();
32 $now = gethrtime();
33 %stats_now = %{$mib->{tcp}};
34
35 $interval = ($now - $then) * 0.000000001;
36 next unless $interval;
37
38 $tcpInDataBytes =
39 $stats_now{tcpInDataInorderBytes} - $stats_then{tcpInDataInorderBytes};
40 $tcpInDataBytes +=
41 $stats_now{tcpInDataUnorderBytes} - $stats_then{tcpInDataUnorderBytes};
40 $tcpInDataBytes /= $interval;
41
42 $tcpOutDataBytes =
43 ($stats_now{tcpOutDataBytes} - $stats_then{tcpOutDataBytes})/$interval;
44 $tcpRetransBytes =
45 ($stats_now{tcpRetransBytes} - $stats_then{tcpRetransBytes})/$interval;
44 $tcpRetransPercent = $tcpOutDataBytes ?
45 100.0 * $tcpRetransBytes / $tcpOutDataBytes : 0.0;
46
47 $tcpOutRsts = ($stats_now{tcpOutRsts} - $stats_then{tcpOutRsts})/$interval;
48 $tcpAttemptFails = ($stats_now{tcpAttemptFails} - $stats_then{tcpAttemptFails})/$interval;
49
50 $tcpInDataSegs =
51 $stats_now{tcpInDataInorderSegs} - $stats_then{tcpInDataInorderSegs};
52 $tcpInDataSegs +=
53 $stats_now{tcpInDataUnorderSegs} - $stats_then{tcpInDataUnorderSegs};
52 $tcpInDataSegs /= $interval;
54 $tcpOutDataSegs =
55 ($stats_now{tcpOutDataSegs} - $stats_then{tcpOutDataSegs})/$interval;
54
56 $tcpActiveOpens =
57 ($stats_now{tcpActiveOpens} - $stats_then{tcpActiveOpens})/$interval;
56 $tcpPassiveOpens =
57 ($stats_now{tcpPassiveOpens} - $stats_then{tcpPassiveOpens})/$interval;
57
58 $tcpListenDrop = ($stats_now{tcpListenDrop} - $stats_then{tcpListenDrop})/$interval;
58 $tcpListenDropQ0 =
59 ($stats_now{tcpListenDropQ0} - $stats_then{tcpListenDropQ0})/$interval;
60 $tcpHalfOpenDrop =
61 ($stats_now{tcpHalfOpenDrop} - $stats_then{tcpHalfOpenDrop})/$interval;
61
62 $tcpInDupBytes = $stats_now{tcpInDataDupBytes} - $stats_then{tcpInDataDupBytes};
62 $tcpInDupBytes +=
63 $stats_now{tcpInDataPartDupBytes} - $stats_then{tcpInDataPartDupBytes};
64 $tcpInDupBytes /= $interval;
65 $tcpInDupPercent = $tcpInDataBytes ?
66 100.0 * $tcpInDupBytes / $tcpInDataBytes : 0.0;
67
68 %stats_then = %stats_now; $then = $now;
69
70 print "high retransmissions, fix network.\n"
71 if $tcpRetransPercent >= retrans_problem;
72 if ($tcpListenDrop + $tcpListenDropQ0 >= listen_problem) {
73 print "Listen queue dropouts, speedup accept processing.\n";
74 print "Listen HalfOpenDrops, possible SYN denial attack.\n"
75 if $tcpHalfOpenDrop >= halfopen_problem;
76 }
77 print "Incoming connections refused: port scanner attack.\n"
78 if $tcpOutRsts >= outrsts_problem;
79 print "Attempt failures: can't connect to remote application.\n"
80 if $tcpAttemptFails >= attempt_fails;
81 print "High duplicate input, fix net and remote server retrans.\n"
82 if $tcpInDupPercent >= indup_problem;
83 };
The first few lines of the program (lines 3 through 5) load the necessary
Perl extensions -- Solaris::MIB2, Time::HR, and Getopt::Std. Time::HR
[7] is a very simple module that allows for measuring elapsed time
intervals with nanosecond precision. The public interface of Time::HR
consists of a single function, gethrtime, which under Solaris simply
calls the gethrtime(3C) function. Getopt::Std is the standard
Perl extension, used to process command-line arguments. The tcpmon
program takes two command-line options, -h, which simply prints
out the usage, and -i, which allows the user to override the
default setting of 5 seconds for the sampling interval.
Lines 8 through 14 declare some thresholds, which will subsequently
be used for diagnosing various network problems. Lines 16 through
18 parse command-line arguments and, in case the help flag -h
is supplied, abort the program, and print the usage information
on the screen. Line 20 constructs a MIB2 object over /dev/tcp. Because
our script is intended for TCP monitoring, we are no longer required
to construct the stream over /dev/ip, hence, there is no need to
run this program set-group-id sys. Once the MIB2 object is
constructed, the program records the value of the high-resolution
timer at line 25 and saves the initial MIB2 statistics into a hash
at line 27.
Once the initialization is completed, the program jumps into an
endless loop at line 29 and suspends itself for the duration of
the sampling interval -- either the value of -i command-line
argument or the default 5 seconds. Upon the expiration of the interval,
the MIB2 hierarchical hash is refreshed using the update function
at line 31. Then, the current value of the high-resolution timer
and current MIB2 statistics are recorded again at lines 32 and 33.
We then calculate the elapsed time interval in seconds and restart
the while loop if the elapsed time is zero.
Lines 38 through 66 perform most of the work. This is where we
calculate the deltas for the TCP counters over the elapsed time
interval. The algorithm for calculating these deltas is borrowed
from the tcp_class.se module, distributed as a part of SE Performance
Monitoring Toolkit [8].
The following measures are calculated:
- tcpRetransPercent -- Percentage of retransmitted
bytes relative to the total number of bytes transmitted over the
time interval.
- tcpListenDrop and tcpListenDropQ0 -- Number
of connections dropped from the completed connection queue and
incomplete connection queue, respectively.
- tcpHalfOpenDrops -- Number of connections dropped
after the initial SYN packet was received over the time interval.
- tcpOutRsts -- Number of TCP segments sent out that
contained the RST flag, over the time interval.
- tcpAttemptFails -- Number of connections that made
a direct transition to the CLOSED state from either SYN-SENT state
or SYN-RCVD state, plus the number of connections that made a
direct transition from SYN-RCVD state to LISTEN state over the
time interval.
- tcpInDupPercent -- Percentage of complete duplicate
data segments received relative to the total number of segments
received over the time interval.
Once all measures are calculated, the program saves the current
TCP statistics and the value of the high-resolution timer for subsequent
iterations of the while loop (line 68) and continues onto carrying
out series of checks (lines 70 through 72).
The program compares the retransmission percentage against the
predefined threshold value. Older releases of Solaris (prior to
Solaris 2.6) had problems with TCP retransmission algorithms, thus
high retransmission percentages seen on these systems may go away
when all necessary TCP patches are applied. On newer systems, however,
high retransmission percentage usually implies that some network
hardware is faulty and dropping packets.
The next two checks are, perhaps, the most interesting and have
more to do with intrusion detection than with performance monitoring.
To fully understand what is going on here, one must understand how
TCP establishes connections. The 3-way handshake connection establishment
process [6] assumes that in order to initiate a connection, a client
application will send a SYN (synchronize sequence numbers) segment,
which specifies the server port number to which this client wants
to connect, and the client's initial sequence number (ISN).
The server then replies with a SYN/ACK packet -- the segment
that contains the server's initial sequence number and the
acknowledgement of the client's SYN. Next, the client acknowledges
the server's SYN with another ACK segment. However, if a client
attempts to connect to a port to which no service is listening,
the server will reply with an RST (reset) packet.
Port Scanning
There a few different techniques that port scanners utilize to
produce a list of services running on a target machine. The simplest
and most basic form of TCP scanning is vanilla connect scan. This
technique relies on the connect(3SOCKET) system call to open
a connection to each port of interest on a target machine. If the
connection succeeds, there's a service listening; otherwise,
the port is unreachable. Apparently, TCP connect scan is very "loud"
as most systems will log the failed connection attempts, and very
inefficient, especially over slow connections.
A much better scanning technique is SYN or half-open scanning.
When using this form of scan, a client will send a SYN packet just
like it would do while initiating a normal connection. If the server
replies with SYN/ACK, the port is in service; if RST is received,
the port is unreachable. Upon receiving a reply from the server,
the client immediately sends back an RST packet, thus tearing down
a connection, which never goes into the established state. SYN scanning
is fairly efficient and significantly less visible, as half-open
connection attempts are normally not logged by the target system.
Yet another scanning technique, even more clandestine than SYN
scanning, is FIN scanning. When FIN scanning, a client sends a FIN
(finish sending data) packet to a server. If the RST reply is received,
the port of interest is closed; however, if the FIN packet is ignored
altogether, the port is listening. As we can see, regardless of
the scanning technique used, the server will most likely send RST
replies out if packets arrive on a closed port. Therefore, to detect
a port scan in progress, all our program has to do is to check the
number of RST packets sent out (tcpOutRsts) against a pre-defined
threshold and report a possible port scan if this threshold is exceeded.
SYN Flooding
The next check is a bit more complex, as it attempts to detect
a possible denial of service (DoS) attack -- SYN flooding. Normally,
while handling incoming connection requests, TCP queues incomplete
connections as well as completed connections, which have not been
accepted (via the accept(3SOCKET) system call) by an application
process. The maximum length of the queue is usually limited to prevent
excessive consumption of system memory. Once the limit is reached,
TCP will silently discard all new incoming connection requests until
all pending connections are processed.
When launching a SYN-flooding attack, a client will first issue
a connection request to the server by sending a packet with SYN
flag set. As opposed to a normal SYN packet, however, this one will
have a client IP address spoofed to be that of an unreachable host.
In an attempt to complete the 3-way handshake, a server will keep
trying to send a SYN/ACK packet to this unreachable host for the
duration of an arbitrary timeout interval. Apparently if the attacking
host sends a few of these SYN requests to a particular port on a
target host (for instance, the telnet port 23), the backlog queue
will fill up with pending connections to the point when the server
starts dropping all new incoming connection requests. Thus, the
server remains practically unusable until it finishes handling all
outstanding connections on its backlog queue -- it is in effect
flooded.
The tcpmon program, therefore, monitors the total number of connections
dropped from the backlog queue (tcpListenDrop + tcpListenDropQ0)
over a period of time, trying to determine whether the backlog limit
has been reached. Backlog queue drops alone may just mean that the
server accept processing is inefficient. However, when paired with
excessive number of half-open connection drops (tcpHalfOpenDrop),
they may be indicative of a SYN-flooding attack in progress.
Recent releases of Solaris are quite resilient to SYN flooding.
Instead of just one backlog queue, Solaris systems feature two.
The first one is the complete connections queue, which holds those
connections for which the 3-way handshake has been completed but
the accept(3SOCKET) call has not yet been issued. Second
is the incomplete connections queue (or Queue 0), which holds one
entry for every SYN packet that arrived. Once the server receives
an ACK from the client, a connection is moved from an incomplete
queue to a complete queue. The size limit value for the incomplete
connection queue is typically quite large, which makes a server
more resistant to SYN-flooding.
In fact, size limit values for both queues, as well as another
parameter -- connection timeout (which affects the duration
of time the server attempts to contact an unreachable host in our
SYN-flooding scenario) -- can be further tuned to maximize the
server's resistance to SYN floods. Perhaps the easiest way
to view or modify the values of these parameters is via the ndd(1M)
command. The following are the variable names, that ndd(1M)
uses to retrieve of set the values of these tunables:
- tcp_conn_req_max_q -- Maximum value of completed
connections waiting for an accept(3SOCKET) call to finish.
- tcp_conn_req_max_q0 -- Maximum number of connections,
where 3-way handshake has not been completed.
- tcp_time_wait_interval -- Maximum amount of time
a TCP socket will remain in TIME_WAIT state.
Thus to read the value of, for example, the size limit of the
completed connection queue, the following command should be executed:
ndd /dev/tcp tcp_conn_req_max_q
For the adventurous types, however, who want complete programmatic
control over the TCP/IP tunable parameters, we created another Perl
module, called Solaris::NDDI [9]. This module does essentially the
same thing as ndd(1M) (although, it doesn't call ndd(1M)
internally but rather utilizes some convoluted C code), and can easily
be used by a regular Perl script. For instance, to read the value
of the same tcp_conn_req_max_q variable, the following code
should be used:
use Solaris::NDDI;
$ndd = new Solaris::NDDI ("/dev/tcp");
print $ndd->{tcp_conn_req_max_q}, "\n";
Having finished with intrusion detection checking, our tcpmon program
looks at two other very simple conditions -- duplicate input percentage
(which is indicative of excessive retransmissions done by remote servers)
and the number of failed attempts to connect to remote applications.
Obviously, this simple monitor packs a lot of useful functionality
into fewer than a hundred lines of code. To ensure that the program
actually does its job, we launched our favorite port scanner from
a remote host as follows:
nmap -sS sun3
Immediately, tcpmon starts outputting the following message:
"Incoming connections refused: port scanner attack."
Although, the example programs described in this article are fairly
rudimentary and lack the strength expected in a robust production
application, I hope enough background information has been presented
to demonstrate the simple yet powerful functionality afforded by the
Solaris::MIB2 module. I also hope this article achieves its goal of
stimulating the reader's appetite for building lightweight flexible
custom network monitors, and that the techniques outlined here can
be used to solve the some challenging network-related problems.
References
1. RFC 1155. Structure and Identification of Management Information
for TCP/IP-based networks.
2. RFC 1213. Management Information Base for Network Management
of TCP/IP-based Internets: MIB-II.
3. Sun Microsystems, Inc. STREAMS Programming Guide. Part Number
805-7478-10.
4. Net::SNMP by David M. Town. www.perl.com/CPAN-local, CPAN directory
DTOWN, Net-SNMP-4.0.1-tar.gz
5. Solaris::MIB2 by Alexander Golomshtok. www.perl.com/CPAN-local,
CPAN directory AGOLOMSH, Solaris-MIB2-0.01.tar.gz
6. TCP/IP Illustrated, Volume 1. W. Richard Stevens. Addison-Wesley
Publishing Company, 1994. ISBN 0-201-63346-9.
7. Time::HR by Alexander Golomshtok. www.perl.com/CPAN-local,
CPAN directory AGOLOMSH, Time-HR-0.01.tar.gz
8. SE Performance Monitoring Toolkit. Adrian Cockcroft, Richard
Pettit. www.setoolkit.com.
9. Solaris::NDDI by Alexander Golomshtok. www.perl.com/CPAN-local,
CPAN directory AGOLOMSH, Solaris-NDDI-0.01.tar.gz
Alexander Golomshtok is a project manager and technology specialist
at JP Morgan Chase. He can be reached at: golomshtok_alexander@jpmorgan.com.
|