How to Read a SCSI Bus Trace
Timothy Swenson
When hooking up SCSI devices, like CD-ROMs and tape drives, things dont always work. Granted, SCSI is a standard, but with the variances in SCSI controllers, device firmware, and SCSI drivers, a SCSI device that works well on one system may totally fail on another system.
The typical response for systems administrators is to contact support for both the computer vendor and device vendor. If the systems administrator is lucky, one of the vendors will have seen the problem before and know the answer. If the systems administrator is unlucky, the two vendors will start pointing fingers at each other, and the problem will not be resolved. If a SCSI bus analyzer is available, then the sys admin can get a few SCSI bus traces and figure out the problem.
SCSI Bus Analyzer
A SCSI bus analyzer listens on the SCSI cable and logs a variety of information about what it sees on the cable. The analyzer consists of a hardware component that plugs into the SCSI chain, and a software component that takes the data from the hardware box and displays it in a human-readable format.
A quick browse on the Web shows that there are a nmber of different types of analyzers from several different vendors. Some are add-ons to computers, some are add-ons to logic analzers, and some are stand-alone. Of those that need a computer to work, some are PCI cards and others link via a PCMCIA card slot. For further information see the following manufacturers:
www.i-tech.com
www.ancot.com
www.verisys.com
www.futureplus.com
A Verisys SCSI-View SV-3000 analyzer was used to create the examples in this article.
SCSI Bus Phase Report
A SCSI bus analyzer can generate a number of different reports. The primary report that is of the most value is the SCSI bus phase report. This report shows the different phases that take place on the bus, including the commands with input and output.
In example trace #1 below, the first column is the sequential number of the phase, with the numbers starting at 1 and going up. The second column is the time, in nanoseconds, between each phase. A timing of 1_281_815 is .1281815 seconds. Timing is usually only important when looking at high wait times between phases, which could result in a SCSI bus timeout and reset. The third column is the phase and shows the individual phase and any data passed. Phases can be broken down into three types:
Waiting Phase Bus Free
Bus Control Phases Arbitration, Selection, Reselection
Information Transfer Phases Message In/Out, Command, Data In/Out, Status
Reading Through the Phases
Again, looking at example trace #1 below, here is a simple walk-through of the phases.
Example trace #1:
1 117_180 Bus Free
2 10_560 Arb
3 32_800 Arbwin 0
4 104_800 +Select 0,5
5 0_615 +Sel/Resel End
6 22_030 +MsgOut C0 Identify
7 1_538_600 CMD - Req Sense
03 00 00 00 40 00
8 373_765 DataIn (N)
F0 00 03 00 ....
02 00 00 0A ....
00 00 00 00 ....
31 00 00 46 1..F
(16 Bytes)
9 1_281_815 DEnd 18 Bytes
10 164_250 Status 00 Good
11 9_925 MsgIn 00 Cmd Complete
12 19_729_780 Bus Free
Line 1 Bus Free. This means that the bus is free and is available for use.
Line 2 Arb. Arbitration is taking place. A SCSI device is asking for control of the bus so it can use it.
Line 3 Arbwin 0. This shows that SCSI device 0 has won the arbitration and has control of the bus.
Line 4 Select 0,5. SCSI Device 0 is selecting to talk with SCSI Device 5.
Line 5 Sel/Resel End. Just shows the end of this phase.
Line 6 MsgOut C0 Identify
Line 7 CMD Req Sense. CMD is used to mark each SCSI command. In this case it is the Request Sense command. The numbers below this line is the actual hex data sent with the Request Sense command.
Line 8 DataIn. Since the previous command/phase requested data, the DataIn phase shows that data being passed back to the requesting device. The following lines are the hex data being passed back. See a SCSI book for exactly what each value means.
Line 9 DEnd 18 Bytes. This shows that the DataIn phase had ended and that 18 bytes are returned.
Line 10 Status 00 Good. This means that everything checks out so far.
Line 11 MsgIn 00 Cmd Complete. This means that the command was successfully sent to the device and it successfully responded back.
Line 12 Bus Free. Back to Bus Free to show that the bus is now available for other devices to use.
Disconnect and Reselection
Individual SCSI conversations between SCSI devices (say a controller and a tape drive) are very phase oriented. Not getting back a reply from a command can cause a SCSI Controller to reset the bus, bringing it and all attached devices back to a known state. But, the SCSI protocol does allow for more than one conversation to take place at about the same time. Since tape devices are relatively slow compared to disk drives, there are times when a SCSI conversation is put on hold until a tape device is finished doing what it needs to do. A good example is the rewind command.
Given the size of some tapes, rewinding from the end can easily take quite a few minutes. The Disconnect message is a way for one device to tell another device that it needs some time to complete a command. It is up to the device sending the Disconnect message to re-establish the conversation, via a Reselection.
Example #2:
1 295_215 Bus Free
2 2_440 Arb
3 1_360 Arbwin 0
4 0_930 +Select 0,3
5 0_555 +Sel/Resel End
6 1_950 +MsgOut C0 Identify
7 279_140 CMD - Wrt Filemrks
10 00 00 00 01 00
8 1_920 MsgIn 04 Disconnect
9 817_346_105 Bus Free
10 2_440 Arb
11 1_350 Arbwin 0
12 0_940 +Select 0,2
13 0_565 +Sel/Resel End
14 1_950 +MsgOut C0 Identify
15 425_900 CMD - Mode Sns(6)
1A 00 00 00 94 00
16 0_505 DataIn (W)
0B 85 10 08 ....
85 00 00 00 ....
00 00 00 00 ....
(12 Bytes)
17 76_545 DEnd 12 Bytes
18 1_350 Status 00 Good
19 0_910 MsgIn 00 Cmd Complete
20 1.576_622_755 Bus Free
21 2_330 Arb
22 1_260 Arbwin 3
23 0_950 Resel 0,3
24 0_535 Sel/Resel End
25 53_860 MsgIn 80 Identify
26 11_450 Status 00 Good
27 0_910 MsgIn 00 Cmd Complete
28 278_935 Bus Free
In this example, the controller starts a conversation with the tape device (SCSI ID 3) and sends the command Wrt Filemrks. Once the tape device receives the command, it replies with the message Disconnect.
The bus then goes free. Now the controller starts a conversation with another device (SCSI ID 2). It send the command Mode Sns (Mode Sense). Device ID 2 then responds to the command, and that conversation is complete.
After the bus is free, the tape device wins the arbitration and starts a conversation with the Controller (ID 0). It then does a Resel 0,3, which means a reselection of the previous conversation. The tape device replies with the message it needs to (Line 25) and now the whole conversation is complete (Line 27 Cmd Complete), and the bus goes free again.
If a device sends a Disconnect message and does not respond within certain period of time (6-20 min.), a timeout value can be reached, and the bus can be reset. A Bus Reset will cause all of the other devices on the same bus to reset, interrupting any data transfer. This is especially bad if tape devices share the same bus with disk drives. Keeping track of the start and finish of commands can show a command being dropped or show that all commands are finished properly.
Bus Reset and Disconnect
The previous example showed a Disconnect when all went well. In this example, not all goes well.
Example Trace #3:
1 2_580 Arb
2 1_230 Arbwin 0
3 1_230 +Select 0,3
4 1_230 +Sel/Resel End
5 3_272_100 +MsgOut C0 Identify
01 02 03 01
Wide Negotiation Width:16 bits
6 22_500 MsgIn 01 02 03 01
Wide Negotiation Width:16 bits
7 2_499_600 +MsgOut 01 03 01 0C 08
Sync Negotiation Pw:50ns,off:8
8 2_552_910 MsgIn 01 03 01 19 08
Sync Negotiation Pw:100ns,off:8
9 457_410 CMD - Stop/Start
1B 00 00 00 00 00
10 2_970 MsgIn 04 Disconnect
11 382.331_911_740 Bus Free
12 24_750 Bus Reset
In this example, the device SCSI ID 3 received a command Stop/Start then sent the message Disconnect and did not reply to the command in 382 seconds, or 6.5 minutes. It looks like either the Controller or an application has a timeout of 6.5 minutes and reset the SCSI bus. This is fine, unless the operation that the device was asked to perform would take longer than 6.5 minutes. A tape rewind would be an operation that could take a long time to finish. In the case of a system resetting the SCSI bus before a rewind command can finish, the proper thing to do is to change the timeout on the system.
Bus Reset
Sometimes a SCSI bus trace can show that the problem is not really based on the commands going across the bus, but lies elsewhere.
Example Trace #4
1 27.763_462_465 Bus Free
2 2_330 Arb
3 1_260 Arbwin 3
4 0_960 Resel 0,3
5 0_525 Sel/Resel End
6 53_850 MsgIn 80 Identify
7 11_450 Status 00 Good
8 0_920 MsgIn 00 Cmd Complete
9 479.373_184_515 Bus Free
10 24_990 Bus Reset
11 0_000 Bus Free
In this trace, everything seems to look fine. We can see device 3 winning the arbitration, then selecting to talk with device 0 (the SCSI Controller). A message is then sent to the controller, the data is received just fine (Status 00 Good), the command is completed, and the bus goes free.
Now it looks like the bus is reset. A bus reset can occur for a variety of reasons, but usually because some timeout value has been reached. In this example trace, there is a corresponding message in the system messages file (/var/adm/SYSLOG or /var/adm/messages):
Dec 2 17:26:46 6A:<host> unix: ql2d6: \
<6>SCSI command timeout: 1 commands:
<6>0x1a <6> Dec 2 17:26:46 6A:<host> \
unix: ql2: <6>Resetting SCSI chip.
To find what is going on, one approach is to look over the whole SCSI bus trace and see whether there are any outstanding commands. In other words, look for any CMD lines with no corresponding Cmd Complete lines. In this example, there were no outstanding commands in the trace, so this provided evidence that the root cause was not at the SCSI bus level, but was higher up on the host system. In this case, it turned out to be a firmware bug with the SCSI controller card.
Check Condition
Some SCSI drivers may not pass all errors that they get to the higher operating system level and act on them silently. A SCSI bus trace can show exactly what the errors are and which command they were in response to.
Example Trace #5:
1 684_120 Bus Free
2 2_400 Arb
3 1_410 Arbwin 0
4 0_960 +Select 0,5
5 0_570 +Sel/Resel End
6 1_950 +MsgOut C0 Identify
7 330_720 CMD - Tst Unit Rdy
00 00 00 00 00 00
8 1_560 Status 02 Check Condition
9 1_020 MsgIn 00 Cmd Complete
10 33_060 Bus Free
11 2_370 Arb
12 1_410 Arbwin 0
13 59_070 +Select 0,5
14 0_570 +Sel/Resel End
15 430_740 +MsgOut 80 Identify
01 03 01 32 08
16 105_210 MsgIn 01 03 01 32 08
17 600_630 CMD - Req Sense
03 00 00 00 20 00
18 43_000 DataIn
0h: 70 00 00 00 02 00 00 00 p.......
8h: 00 00 00 00 00 00 15 00 ........
10h: 00 00 00 00 00 00 00 00 ........
18h: 04 00 01 00 00 00 00 00 ........
20h: 00 00 00 00 86 00 02 00 ........
28h: CE 00 00 00 00 00 27 00 .......
30h: 30 00 00 00 00 00 00 00 0.......
(58 Bytes)
19 31_700 DEnd 58 Bytes
20 1_410 Status 00 Good
21 1_020 MsgIn 00 Cmd Complete
22 2.018_814_360 Bus Free
23 2_370 Arb
24 1_410 Arbwin 0
25 0_960 +Select 0,5
26 0_600 +Sel/Resel End
27 1_950 +MsgOut C0 Identify
28 304_200 CMD - Tst Unit Rdy
00 00 00 00 00 00
29 1_620 Status 02 Check Condition
30 0_990 MsgIn 00 Cmd Complete
31 33_030 Bus Free
In this example, the command Test Unit Ready (Tst Unit Rdy) was sent to the tape drive. Notice that the Status line says Check Condition instead of the usual Good. This means that there was some error in either the sent command or in the execution of the command. Check Condition is the response to the Test Unit Ready command, but it does not specify exactly what the error is. A Request Sense (Req Sense) command must be sent to the tape drive and ask it what the error was.
The Next Level
The next level of analysis, which is beyond the scope of this article, is to analyze the data sent in commands and the data sent in the response. Each command and its response has a defined byte sequence, and each bit or byte has a meaning. By referencing the data returned from a command, especially the return from a Request Sense command, to its definition, the exact meaning of the reply can be determined. At the level mentioned in this article, we can only tell that a Check Condition happened and that a Request Sense command what sent. By digging deeper, the exact error can be derived from the reply bytes of the Request Sense command.
Sources of Additional Information
Besides the sources below, tape drive vendors have some fairly detailed books that cover their tape drives and the specific SCSI commands that they support. Some of these documents cover the SCSI commands in great detail.
Web Pages:
www.scsifaq.org
Books:
The Book of SCSI: A Guide for Adventurers by Peter M. Ridge and David Deming. 1995. No Starch Press.
The SCSI Bus and IDE Interface by Friedhelm Schmidt. 1997. Addison-Wesley.
Programmers Guide to SCSI by Brian Sawert. 1998. Addison-Wesley.
About the Author
Timothy Swenson is a UNIX contractor with Taos Mountain, in Silicon Valley. After receiving a B.S. in Computer Science from Cal State Hayward, Tim spent 8 years as an Air Force Officer, doing sys admin and acquisition work. Tim is also the editor of the QL Hackers Journal and writes for other Sinclair publications. He can be reached at swensont@jack.sns.com.
|