Cover V11, I08

Article

aug2002.tar

Questions and Answers

Amy Rich

Q Our Ultra 60 running Solaris 8 keeps crashing on us, and I want to do some crash dump analysis. Unfortunately, I don't really know how to go about getting the crash dump, or how to read the dump once I have it. Could you offer some advice?

A I highly recommend the book Panic! Unix System Crash Dump Analysis, by Chris Drake and Kimberley Brown (Prentice Hall, 1995, ISBN 0131493868). I'm not sure if both of the authors are still with Sun, but they were when they wrote the book. Although this book is a bit dated, and I really wish they would put out a second edition for the Ultra SPARC family, this book is still the bible for analyzing UNIX system crashes. It covers:

  • The difference between panics and hangs
  • Header files, symbols, and symbol tables
  • A tutorial on how to use adb
  • The stack and stack traces
  • An introduction to assembly
  • An overview of UNIX internals
  • The SPARC processor and instruction set

Here is some basic information to get you started. Make sure that you have savecore enabled, or you're not going to get a crash dump. In Solaris 8, savecore is enabled by default, but if it has been disabled for some reason, you can re-enable it by running:

/usr/sbin/dumpadm -y
This modifies /etc/dumpadm.comf so that savecore automatically runs for every reboot. Also make sure that /etc/rc2.d/S75savecore exists and is intact. If you look at /etc/dumpadm.conf, you'll see that it also stores the directory location of the crash dumps, usually /var/crash/'/bin/uname -n'/. If you have changed machine names, you'll need to change dumpadm.conf to reflect the new machine name. Also be sure that you have a valid disk slice (like /dev/dsk/c0t0d0s1) listed after DUMPADM_DEVICE=, and not part of a DiskSuite or Veritas mirror. This slice should not be smaller than the amount of physical memory you have, or you'll run out of space when writing out the savecore file.

When the machine crashes again (/usr/bin/who -b will tell you when it came back up), you should have a core file to dissect located in /var/crash/'/bin/uname -n'/vmcore.N, where N is a number. The default namelist for your core dump will be /var/crash/'/bin/uname -n'/unix.N, where N will correspond to that of the vmcore file.

Let's say that you've just crashed the machine and your files are /var/crash/myhost.my.com/vmcore.1 and /var/crash/myhost.my.com/unix.1. You can get some basic information by using /usr/sbin/crash. Starting with Solaris 8, you should transition from using /usr/sbin/crash to using /usr/bin/mdb, the modular debugger (contained in packages SUNWmdb and SUNWmdbx). Since most people are familiar with /usr/sbin/crash, though, I'll cover that instead.

Load the core file and the associated namelist:

/usr/sbin/crash -d /var/crash/myhost.my.com/vmcore.1 -n /var/crash/myhost.my.com/unix.1
Now you can do things like print out all entries in the process table:

proc -e
Print out information about open special filenames:

snode -e -l
Print out the tunable parameters:

var
For more information on the capabilities of /usr/bin/crash, see the man page.

You can also use /usr/bin/ipcs to look at the inter-process communication status of things when the core was taken:

/usr/sbin/ipcs -C /var/crash/myhost.my.com/vmcore.1 -N /var/crash/myhost.my.com/unix.1
Q We were looking for a cheap backup solution for our small office and we've decided to use Amanda. Our dump host is a pokey old SPARC 20 running Solaris 8 with a Spectra Logic Treefrog attached to it. I've installed the software on the server and on the clients, but I'm having issues dealing with the tape drive. I am looking for a changer script that would play well with the Treefrog. Do you know of anything that might work?

A First, make sure the Treefrog configuration switch on the back of the unit is set to 9 so that it interfaces with Solaris. Second, make sure you enable the sgen driver and have it create a device for the changer in /dev/scsi/changer. To create the entry, add the following line to /kernel/drv/sgen.conf:

device-type-config-list="changer";
You'll also need to add lines to /kernel/drv/sgen.conf that tell it to search each device for the changer. You can add just one line for the location of your changer, or you can have it search an entire LUN if you think you might move the changer on the SCSI chain. Here's what the syntax would look like for searching all of LUN 0:

name="sgen" class="scsi" target=0 lun=0;
name="sgen" class="scsi" target=1 lun=0;
name="sgen" class="scsi" target=2 lun=0;
name="sgen" class="scsi" target=3 lun=0;
name="sgen" class="scsi" target=4 lun=0;
name="sgen" class="scsi" target=5 lun=0;
name="sgen" class="scsi" target=6 lun=0;
name="sgen" class="scsi" target=7 lun=0;
name="sgen" class="scsi" target=8 lun=0;
name="sgen" class="scsi" target=9 lun=0;
name="sgen" class="scsi" target=10 lun=0;
name="sgen" class="scsi" target=11 lun=0;
name="sgen" class="scsi" target=12 lun=0;
name="sgen" class="scsi" target=13 lun=0;
name="sgen" class="scsi" target=14 lun=0;
name="sgen" class="scsi" target=15 lun=0;
Once you've added the lines to /kernel/drv/sgen.conf, halt the machine, attach the tape drive and power it on, then reboot the machine. If you don't see a device entry in /dev/scsi/changer/, try running /usr/sbin/devfsadm to scan for new devices.

Once the library is properly configured, install mtx (http://mtx.sourceforge.net/) to talk to it. Once you're able to control the library with mtx, you can hook it into Amanda.

You could probably use the chg-mtx or chg-scsi scripts that come with Amanda, but I found the following changer script while searching the amanda-users mail archive (http://www.amanda.org/). chg-spectra was written by Stephen Carville specifically for controlling the Treefrog with mtx. The GNU GPL that the software references is available at http://www.gnu.org/copyleft/gpl.html. I've tested this changer script, and it seems to work well. You may need to change a couple of paths, depending on where you installed Amanda, but this should work pretty much right out of the box:

#!/usr/bin/perl -w
#
# Tape changer glue script for a Spectra 2000 tape changer
# Does not include the 'clean' option
#
# Version 1.0
# Copyright (C) 2001 Stephen Carville
#                    Unix and Network Administrator
#                    Ace Flood USA
#                    stephen@totalflood.com
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to:
# Free Software Foundation, Inc.
# 59 Temple Place, Suite 330,
# Boston, MA  02111-1307  USA

use strict;
# use DB_File;

# paths to amanda directories
my $PREFIX = "/usr/local";
my $SBIN = "$PREFIX/sbin";
my $BIN = "$PREFIX/bin";
my $LIBEXEC = "$PREFIX/libexec";

# debugging files -- set DEBUGDIR to "" to disable debugging output
my $DEBUGDIR = "/usr/local/etc/amanda/tmp";
my $DBGFILE = "changer.debug";

# set a conservative path.  All commands are full paths so an empty path
# would be OK
my $PATH = "/bin:/sbin";

# commands -- change these to suit your configuration
my $MT =  "/usr/bin/mt";
my $MTX = "/usr/local/sbin/mtx";
my $AMGETCONF = "$SBIN/amgetconf";
my $MAILER = "/usr/bin/mailx";

# ----------- end of user configuration --------------
# exit codes
my ($SUCCESS,$ERROR,$BADERROR) = (0,1,2);

my ($cmd,$parm,$line,$rtn,$status);
my ($tapedev,$changerdev,$changerfile,$mailto);
my ($cleanfile,$accessfile,$slotfile,$labelfile,$dbgfile);

# careful! - global!
my ($firstslot,$lastslot,$cleanslot,$havereader);

if (defined $ARGV[0]) {$cmd = $ARGV[0];} else {$cmd="";};
if (defined $ARGV[1]) {$parm = $ARGV[1];} else {$parm = "";};

# get rid of any leading '-' in the command
$cmd =~ s/-//;

# print "$cmd,$parm\n";

# set the execution path
$ENV{PATH} = $PATH;

# get some values from the configuration file -- careful these are
# global
$tapedev = '$AMGETCONF tapedev'; $tapedev =~ s/\n//;
$changerdev = '$AMGETCONF changerdev'; $changerdev =~ s/\n//;
$changerfile =  '$AMGETCONF changerfile'; $changerfile =~ s/\n//;
$mailto = '$AMGETCONF mailto'; $mailto =~ s/\n//;

# mtx doesn't always need the changer device but it doesn't hurt
$MTX .= " -f $changerdev";
# OTOH, mt usually does need the tape device
$MT .= " -f $tapedev";

# set the debug file
if ( -d $DEBUGDIR) {
  $dbgfile = "$DEBUGDIR/$DBGFILE";
} else {
  $dbgfile = "/dev/null";
}
# this makes it easier to tell what happened
debugmsg ("==========");

# print "$tapedev,$changerdev,$changerfile\n";

# I only use barcodes and conf but I include the rest JIC.
$cleanfile = "$changerfile-clean";
$accessfile = "$changerfile-access";
$slotfile = "$changerfile-slot";
$labelfile = "$changerfile-barcodes";
$changerfile .= ".conf";

# get values from the changerfile
open FILE, $changerfile or die "$BADERROR: Unable to open $changerfile";

foreach $line (<FILE>) {
  $line =~ s/\n//;
  $line =~ s/\s+//g;
  if ($line =~ m/^\#/ || $line eq "") {
    next;
  }
  $line =~ m/(\w+)=(\d+)/;
  # no tricky way to do this?
  if ($1 eq "firstslot") {$firstslot = $2; next; }
  if ($1 eq "lastslot") {$lastslot = $2; next; }
  if ($1 eq "cleanslot") {$cleanslot = $2; next; }
  if ($1 eq "havereader") {$havereader = $2; next; }
}

# print "$firstslot,$lastslot,$cleanslot,$havereader\n";

# branch to the appropiate subroutine
# how are we callled? slot, info, reset, eject, label, search, clean
if ($cmd eq "slot") {
  ($rtn,$status) = slot ($parm);
  print "$status\n";
  exit $rtn;
}
if ($cmd eq "info") {
  ($rtn,$status) = info ($parm);
  print "$status\n";
  exit $rtn;
}
if ($cmd eq "reset") {
  ($rtn,$status) = rset ($parm);
  print "$status\n";
  exit $rtn;
}
if ($cmd eq "eject") {
  ($rtn,$status) = eject ($parm);
  print "$status\n";
  exit $rtn;
}
if ($cmd eq "label") {
  ($rtn,$status) = label ($parm);
  print "$status\n";
  exit $rtn;
}
if ($cmd eq "search") {
  ($rtn,$status) = search ($parm);
  print "$status\n";
  exit $rtn;
}
if ($cmd eq "clean") {
  ($rtn,$status) = clean ($parm);
  print "$status\n";
  exit $rtn;
}

# if get this far cmd is no good
print "Unkown option: $cmd\n";
exit $BADERROR;

# command subroutines

# move a tape from a slot to the tape drive
sub slot {
  my ($parm) = @_;
  my ($usedslot,$barcode,$loadslot);

  ($usedslot,$barcode) = status ();
  $loadslot = -1;

  debugmsg ("SLOT->load tape from slot $parm");

  # what were we asked to do?
  # current -- return current loaded slot or load first slot
  if ($parm eq "current") {
    if ($usedslot < $firstslot) {
      $loadslot = $firstslot;
    } else {
      return $SUCCESS,"$usedslot $tapedev";
    }
    return load ($loadslot);
  }

  # next or advance -- load the next tape or first slot
  if ($parm eq "next" or $parm eq "advance") {
    unload ($usedslot);
    $loadslot = $usedslot+1;
    if ($loadslot > $lastslot) {
      $loadslot = $firstslot;
    }
    return load ($loadslot);
  }

  # prev -- load the previous or last slot
  if ($parm eq "prev") {
    unload ($usedslot);
    $loadslot = $usedslot-1;
    if ($loadslot < $firstslot) {
      $loadslot = $lastslot;
    }
    return load ($loadslot);
  }

  # first -- load the first slot
  if ($parm eq "first") {
    unload ($usedslot);
    return load ($firstslot);
  }

  # last -- load the last slot
  if ($parm eq "last") {
    unload ($usedslot);
    return load ($lastslot);
  }

  # clean -- I do not support cleaning. I prefer to set it up
  # myself
  if ($parm eq "clean") {
    return clean($parm);
  }

  # check for a legitimate number
  unless ($parm =~ m/\d+/) {
    debugmsg ("SLOT->bad slot ID $parm");
    return $ERROR,"0 Slot $parm is out of range ($firstslot -- $lastslot";
  }
  if ($parm >= $firstslot && $parm <= $lastslot) {
    if ($parm == $usedslot) {
      return $SUCCESS,"$usedslot $tapedev";
    }
    unload ($usedslot);
    return load ($parm);
  } else {
    return $BADERROR, "0 Slot $parm is out of range ($firstslot -- $lastslot)\n";
  }
  return $BADERROR,"0 Unknown error!\n";
}

# return some information about capabilities
sub info {
  my ($parm) = @_;
  my ($usedslot,$barcode,$str);

  ($usedslot,$barcode) = status();

  debugmsg("INFO -> current slot $usedslot, last slot $lastslot");

  if ($usedslot < 0) {
    $usedslot = 0;
  }
  $str = "$usedslot $lastslot 1";

  # someday check the tape library for havereader
  if ($havereader) {
    $str .= " 1";
  }
  debugmsg ("INFO -> $str");
  return $SUCCESS,"$str";
}

# reset the tape library to a known state
sub rset {
  my ($parm) = @_;
  my ($usedslot,$barcode);

  ($usedslot,$barcode) = status();

  debugmsg ("RESET -> loading tape from slot $firstslot to $tapedev");

  # first eject any tape already loaded
  unload ($usedslot);
  return load ($firstslot);
}

# eject the current tape
sub eject {
  my ($usedslot,$barcode);

  ($usedslot,$barcode) = status();

  debugmsg ("EJECT->unloading slot $usedslot");
  return unload ($usedslot);
}

# add a label to the barcodes file
sub label {
  my ($label) = @_;

  my ($usedslot,$barcode);
  my (%barcodes);

  ($usedslot,$barcode) = status();
  %barcodes = load_labelfile();

  debugmsg("Adding Barcode $barcode and amlabel $label for Slot $usedslot \
            into $labelfile");

  # if the label exists and is synced
  if (exists $barcodes{$label} && $barcode eq $barcodes{$label} ) {
    debugmsg("Barcode $barcode already synced for $label");
  } else {
    # any other condition, add/overwrite it and save the new file
    $barcodes{$label} = $barcode;
    save_labelfile(%barcodes);
  }
  return $SUCCESS, "0 $usedslot $tapedev";
}

# search for a tape in barcodes and load it into the drive
sub search {
  my ($label) = @_;
  my (%labels,$slot,$err);

  debugmsg("SEARCH->searching for tape $label");

  %labels = load_labelfile();

  # if the label is in the file
  if (exists $labels{$label}) {
    $slot = findbybarcode($labels{$label});
  } else {
    debugmsg("SEARCH->tape $label not found in barcode database");
    errormsg("Error with Barcode reader","Tape $label not found in barcode database");
    return $BADERROR,"tape $label not found in barcode database";
  }
  # if the tape was not found in the changer
  if ($slot < 0) {
    debugmsg("SEARCH->tape $label not found in changer");
    errormsg("Error with Barcode reader","Tape $label not found in changer");
    return $BADERROR,"tape $label not found in changer";
  }
  # tape is already loaded
  if ($slot == 0) {
    return $SUCCESS,"$tapedev";
  }
  # otherwise load the tape
  debugmsg("SEARCH->tape $label found at slot $slot");
  sleep (10);
  load($slot);
  return $SUCCESS,"$tapedev";
}

# cleaning not supported yet
sub clean {
  return $ERROR,"0 Cleaning not supported by driver\n";
}

# unload a tape from the drive and replace it in its slot
sub unload {
  my ($slot) = @_;
  my (@result);

  # we know that -1 is empty
  if ($slot < 0) {
    debugmsg ("UNLOAD->$tapedev already empty");
    return $ERROR, "0 Drive was not loaded";
  }
  @result = '$MTX unload $slot 2>&1';
  if ($?) {
    debugmsg ("UNLOAD->error unloading slot $slot");
    return $BADERROR,"0 @result";
  }
  debugmsg ("UNLOAD->unloaded tape to slot $slot");
  return $SUCCESS,"$slot $tapedev";
}

# load a tape into the drive
sub load {
  my ($slot) = @_;
  my (@result,$cntr,$off);

  debugmsg ("LOAD->loading tape from slot $slot");

  @result = '$MTX load $slot 2>&1';
  if ($?) {
    debugmsg ("LOAD->result == @result");
    debugmsg ("LOAD->error in loading tape from slot $slot");
    return $BADERROR,"$slot @result";
  }

  # wait for drive to go online but not forever
  $cntr = 0;
  $off = "";
  while ($off eq "") {
    @result = '$MT status 2>&1';
    $off = grep (/offline/,@result);
    $cntr++;
    # is this the last try?
    if ($cntr > 10) {
      debugmsg ("LOAD->still offline at try # $cntr");
      return $BADERROR,"$slot @result";
    }
    sleep(10);
  }
  # now rewind the tape
  @result = '$MT rewind 2>&1';
  sleep (1);

  # phew!
  debugmsg ("LOAD->$slot $tapedev");
  return $SUCCESS,"$slot $tapedev";
}

#
# load the label database this is a space separated text file
# but I return a hash to make manipulation easier. This should use
# DB_File but not all systems have it installed so I follow the
# KISS principle. returns %hash{$label}=$barcode
sub load_labelfile {
  my (%labels,$line,$lbl,$bc);

  unless (-f $labelfile) {
    return %labels;
  }
  open LABEL, "$labelfile";
  foreach $line (<LABEL>) {
    $line =~ s/\n//;
    ($lbl,$bc) = split /\s+/, $line;
    if (defined $lbl and defined $bc) {
      $labels{$lbl} = $bc;
    }
  }
  close LABEL;
  return %labels;
}

# save the labelfile -- this completely recreates the file!
sub save_labelfile {
  my (%labels,$lbl) = @_;

  # clobber the existing file
  open LABEL, ">$labelfile";
  foreach $lbl (keys %labels) {
    print LABEL "$lbl $labels{$lbl}\n";
  }
  close LABEL;
}

# return the slot for a particular barcode
sub findbybarcode {
  my ($bc) = @_;
  my (@result,$line,$slot);

  @result = '$MTX status 2>&1';

  foreach $line (@result) {
    if ($line =~ /[\w\s]+ Element (0):Full \([\w\s\d]+\):VolumeTag = (.+)/) {
      if ($bc eq $2) {
        debugmsg("FINDBYBARCODE->Label $bc is at slot $1");
        return $1;
      }
    }
    if ($line =~ /\w+\s+Element (\d+):Full :VolumeTag=(.+)/) {
      if ($bc eq $2) {
        debugmsg("FINDBYBARCODE->Label $bc is at slot $1");
        return $1;
      }
    }
  }
  debugmsg("FINDBYBARCODE->Label $bc not found!");
  return -1;
}

# return the slot and barcode of the loaded tape
sub status {
  my (@result,$line,$stat,$el,$bc);

  # get the changer status
  @result = '$MTX status 2>&1';

  # find the line with the tape drive in it
  foreach $line (@result) {
    if ($line =~ m/Data Transfer Element 0:(.+)/) {
      # preserve the slot number and barcode
      $stat = $1;
      last;
    }
  }

  #split the slot number from the barcode
  ($el,$bc) = split /:/,$stat;

  # get the loaded element
  if ($el =~ m/Full \(Storage Element (\d+) Loaded/) {
    $el = $1;
  } else {
    $el = -1;
  }
  if (defined $bc) {
    # get the loaded barcode
    if ($bc =~ m/VolumeTag = (.+)/) {
      $bc = $1;
    } else {
      $bc = -1;
    }
  } else {
    $bc = -1;
  }
#  print "$el,$bc\n";
  return $el,$bc;
}

# add a debugging messsage to the debug file
sub debugmsg {
  my ($message) = @_;

  open DBG, ">>$dbgfile";
  print DBG $message;
  print DBG "\n";
  close DBG;
}

# send an error message to mailto
sub errormsg {
  my ($subject,$message) = @_;

  open PIPE, "| $MAILER -r \"$mailto\" -s \"$subject\" $mailto";
  print PIPE $message;
  close PIPE;
}
Set up Amanda on Solaris 8 with spectra logic tape drive and include changer source.

Q I have a Sun IPX that was running Solaris 7 and getting pretty crufty, so I reinstalled it from the CD. I put on the full+OEM distribution. This machine was booting fine before I reinstalled it, but now I get the following error when I try to boot from the disk:

Boot device: /sbus/esp@0,800000/sd@3,0   File and args:
boot: cannot find misc/krtld
boot: error loading interpreter (misc/krtld)
Elf32 read error.
boot failed
Enter filename [/platform/SUNW,Sun_4_50/kernel/unix]:
Post doesn't show any hardware issues, and I've tried swapping out various bits of the machine to make sure I wasn't missing something. I also tried reinstalling again, paying closer attention to check for any errors. Since the machine will boot fine off the CDROM, I also tried swapping in a different disk and installing onto that. Nothing seems to actually be wrong with the machine, and yet I can't get it to boot off the disk.

A Your IPX is one of the old 32-bit sun4c systems. Based on the information you've given me, it sounds like you're using a disk that's larger than 2G, and your root partition is going past the 2G boundary. Because of limitations in the Openboot PROM, you can't boot any of the 32-bit SPARCs from a root partition that has tracks lying beyond the 2G boundary. On systems with really old PROM revisions (2.5 or previous), you need to make the root partition smaller than 1G.

The PROMs in the newer 64-bit Ultra class machines are capable of having root partitions that go beyond the 2G boundary, but versions of Solaris prior to 2.6 contain a bug that effectively prevents it. Patch 103640-08 (or a later revision) corrects this for Solaris 2.5.1.

Other typical error messages you'll see when going beyond the 1/2G boundary:

bootblk: can't find the boot program
boot: cannot find misc/krtld
Short read.  0x2000 chars read
Read error.
Amy Rich, president of the Boston-based Oceanwave Consulting, Inc. (http://www.oceanwave.com), has been a UNIX systems administrator for more than five years. She received a BSCS at Worcester Polytechnic Institute, and can be reached at: qna@oceanwave.com.