Cover V11, I10

Article
Figure 1
Listing 1
Listing 2
Listing 3

oct2002.tar

Exploring the SolarisTM Device Tree

Shaun Clowes

New Solaris administrators often struggle trying to find commands to query the hardware attached to their machines. This is hardly surprising given that the number of machines most administrators manage has increased during the past few years, making it more difficult to remember the configuration of them all. Fortunately, there are some very useful system tools such as cfgadm, prtconf, sysdef, and format that can help. You may have used these tools and perhaps wondered how they determine the hardware configuration. The main source of this information is the device tree -- data kept by the kernel describing the hardware attached to the machine.

In this article, I will examine the internals of the Solaris device tree. I will also show how you can use the device tree to create your own special-purpose system tools. Note that I'll describe the tree and its features using terminology from Solaris 8 documentation. Solaris 7 and earlier releases used different terminology but the concepts and the code implementation (with caveats as described) are the same.

The Device Tree

The device tree is kept in kernel memory and contains the kernel's current information about the hardware configuration of the machine. Each node in the tree is either a bus nexus node or a leaf (device) node. A bus nexus node describes a device or machine component that can logically or physically contain other devices. For example, on most machines, there would be a bus nexus node for a PCI bus. Most of the nodes in the device tree are not bus nexus nodes, instead they describe devices, such as network cards. These nodes are referred to as leaf nodes because they are always attached to a bus nexus node and cannot contain other devices. For example, the network card leaf node could be owned by a PCI node but could not have further leaf nodes attached to it. The tree is rooted at a bus nexus node for the hardware platform of the machine that is linked to some of the machine's buses, which eventually link to all of the device leaf nodes.

The basic structure of the device tree is shown in Figure 1. The dotted lines in the diagram indicate conceptual links, with bus nexus nodes having links to all of the devices on the bus in a parent/child relationship. As it turns out, the nexus node really only has a link to its first child, which then links to all of its siblings (the black lines represent these real links). The real links will be discussed in a later section.

It is important to note that a device represented in the tree need not be an accessible piece of hardware. Nodes may exist for devices that do not have drivers available as well as for pseudo devices. Pseudo devices are devices that are provided by the kernel (or drivers in the kernel) and do not exist physically, only logically (e.g., the kmem device (/dev/kmem) that provides access to kernel memory). Pseudo device nodes are attached to a special pseudo bus nexus node (since logical devices clearly can't be attached to a physical bus).

Inside a Node

Each device node contains a variety of information about the device. The node format is very flexible and can accommodate both general information needed by the system and specific information needed by drivers or userland applications. This flexibility means that a node is actually quite a complex data structure, thus I will only cover a subset of the most useful parts of the structure.

The basic information in a node is a set of pointers to the node's parent, sibling, and first child. The parent pointer contains the address of the bus nexus node representing the bus to which this device is attached. The sibling pointer contains the address of the node for the next device attached to the same parent bus, if any exists. The child pointer is only used in bus nexus nodes with attached devices and contains the address of the first node describing a device attached to the bus. These fields form the tree structure. To traverse a bus, the first child node is found using the child pointer from the bus nexus node. Then, all of the other children can be found by following the sibling pointers in each node attached to the bus.

Each node provides a variety of general information; in particular, each node must provide a name and binding name. Other information that may be provided includes a bus address, driver name, device state, and device instance number. The node name is the most descriptive of the general fields to the observer and can (depending on the device and driver) be a generic name or a meaningful description of the device. The binding name is used by the kernel to find an appropriate driver to handle the device. The bus address describes the device's location on the bus to which it is attached, often in hardware terms. The driver name describes the driver that is managing this device, if one exists and is bound to the device. The device state is normally 0 if the device is functioning correctly, but can indicate whether the device or its parent bus is down. The instance number can be set by the system to differentiate between devices attached to the same driver. As an example of these fields, a typical SCSI disk might have the following field values: name "sd", binding name "sd", bus "1,0", driver "sd", state "0", instance "1".

The flexibility of the structure is provided in its ability to contain "properties" for each node. Each node has three separate lists of properties (driver, system, and hardware) that contain information related to each aspect of the device. Each list can hold any number of properties, each having a name, a type, and data. The name describes the property being provided. The type describes the format of the data contained in the property. The property can contain a boolean, an array of integers, an array of strings, or an array of bytes. The data held by the property is then interpreted according to the type specified. Obviously, this mechanism is very powerful because it allows virtually any sort of information about a device to be provided to requestors who understand the property names and data contained within.

Some devices can be directly addressed from userspace through the use of device special files. Obvious examples include the disk block and character files in /dev (e.g., /dev/dsk/c0t0d0s0). These devices have a list of minor nodes attached to their node in the device tree; each minor node describes a device special file that can be used to address the device. Information contained in each minor node includes a name, device type, special file type, and dev_t. The "name" field provides an identifying name for the minor device. The "device type" field identifies the type of device that this file can be used to address (e.g., it might identify the device as a disk). The "special file type" indicates whether this special device file is a character or block file. The dev_t is the combination of a major and minor pair used to create the special file with mknod. For example, a SCSI disk might have a minor node with special file type "character", name "a,raw", type "ddi_block:channel", and devt "32,0" to refer to the character special file for slice 0 (a = 0) of the disk. An administrator could take this information and use "mknod disks0 c 32 0" to create a special file that could be used to refer to this device.

The /devices Directory

This may sound familiar to many readers, even those who have not heard of the device tree before, because a version of the device tree is exported to userland in the form of /devices directory. The main difference between the in-kernel device tree and the /devices hierarchy is that only devices that are bound to a driver, and thus usable, are present in the /devices tree; the in-kernel tree contains all devices attached to the system, bound or not.

A heavily abbreviated listing of the /devices tree on a fairly typical machine looks like the following:

/devices
/devices/pseudo
/devices/pseudo/icmp@0:icmp
/devices/pci@1f,4000:devctl
/devices/pci@1f,4000
/devices/pci@1f,4000/scsi@3:devctl
/devices/pci@1f,4000/scsi@3:scsi
/devices/pci@1f,4000/ebus@1
/devices/pci@1f,4000/ebus@1/se@14,400000:0,hdlc
/devices/pci@1f,4000/ebus@1/se@14,400000:a,cu
/devices/pci@1f,4000/scsi@3
/devices/pci@1f,4000/scsi@3/sd@0,0:a
/devices/pci@1f,4000/scsi@3/sd@0,0:a,raw
/devices/pci@1f,4000/scsi@3/sd@6,0:a
/devices/pci@1f,4000/scsi@3/sd@6,0:a,raw
It's fairly easy to see the device tree structure in this listing. Each device tree node has a representative name of the form <node name>@<bus address>. For each device node, one file is present for each of the minor nodes attached to the node with a filename of the form <representative name>:<minor node name> (with the correct major and minor for the device). For bus nexus nodes that have child nodes, there is also a directory with name <representative name>, which contains the files representing the child nodes. This listing also highlights the descriptive power of the device tree -- it shows that there is a disk or CD-ROM (managed by the sd driver) attached to the first SCSI bus on the first PCI controller.

The kernel always refers to devices through the device tree, and thus through names like those shown in the listing. (Administrative names like c0t0d0s0 have no meaning to the kernel.) In fact, the normal administrative /dev nodes are all symbolic links to the related device in the /devices tree. They are constructed by devfsadmd (or devlinks and associated utilities in older versions of Solaris) merely as a convenience to the administrator. Thus, it's possible for a device to be perfectly functional and usable via its /devices file without having an entry in the /dev tree.

System Tools

As previously mentioned, tools such as prtconf and cfgadm are dependent on the device tree and use it to retrieve most of the information they provide. The following is a heavily edited portion of the output from prtconf -v:

pci, instance #0
     Driver properties:
         name <interrupt-priorities> length <24>
             value <0x0000000e0000000e0000000e0000000e0000000e0000000e>.
     scsi, instance #0
         sd, instance #0
             System properties:
                 name <lun> length <4>
                     value <0x00000000>.
             Driver properties:
                 name <inquiry-product-id> length <17>
                      value 'MAG3091L SUN9.0G'
It is easy to see the relationship to the device tree in this output; in fact, this is basically just a dump of the tree contents. prtconf shows the name and instance number for each node in the device tree, along with any properties in the system, hardware, and driver properties lists. It then lists all the child nodes for bus nexus nodes. Tools like prtconf are very useful, because administrators can use them to query the kernel's current view of the hardware and driver configuration of the machine.

Unfortunately, prtconf isn't particularly useful for querying particular aspects of the machine configuration. It can take some scripting magic to use its output to provide something as simple as a list of all CD-ROM drives in the machine. In Solaris 8, this information can be easily obtained from cfgadm -a, however it is still useful to write custom tools to obtain information directly from the device tree (a process illustrated in the next sections).

Reading the Device Tree

The first question is how to get at the tree. The tree is stored in kernel memory with the first node at the symbol top_devinfo, which can be found in the sys/ddi_impldefs.h header file. Each node is represented by a dev_info structure, which is also described in the ddi_impledefs.h header file, which means that the tree can be read by users' applications (with root privileges) by reading kernel memory through /dev/kmem character special device file (or indirectly through libkvm routines). As Alexander Golomshtok and Yefim Nodelman described in "Managing Solaris with Kstat" (Sys Admin magazine, October 2001: http://www.samag.com/documents/s=1323/sam0110a/0110a.htm) a seek() on this file positions the file pointer at a location in kernel memory and read() copies the kernel memory to the buffer in user space. Listing 1 shows how the root device node could be read from kernel memory using this technique. (All Sys Admin magazine listings are available at: http://www.sysadminmag.com/code/.)

There are a number of disadvantages to reading the device tree directly from kernel memory. One major issue is that there is no mechanism available for the reader to insure the tree isn't modified while it is being read. This can lead to inconsistent or incomplete data being returned. Another problem is that the program then becomes tied to the word size of the kernel. If the kernel is 64-bit, the program using libkvm to access kernel memory must also be 64 bit, even though 32-bit applications can run under 64-bit kernels. Finally, accessing data this way is simply clunky, particularly reading pointers then the data they point to. What is the alternative?

Introducing the devinfo Library

The devinfo library (libdevinfo) provides a suite of routines that can be used to query the device tree. It is used by most of the system tools mentioned throughout this article to access the information in the tree. There are two versions of libdevinfo. The version distributed with Solaris 2.5.1 and 2.6 is entirely undocumented and was rewritten, documented, then distributed with Solaris 7 and 8. This means that for users of Solaris 7 and 8, it's quite simple to retrieve information from the tree (as I'll demonstrate), but what about users of 2.6 and 2.7?

The obvious answer is to use the kmem interface as described in the previous section. It's reasonably straightforward (but less convenient) to perform the same functions that are possible under the later version of libdevinfo using this method. The undocumented libdevinfo on these systems is provided exclusively for the use of the system tools, and the only documentation I have found for it is an old newsgroup posting (http://groups.google.com/groups?q=libdevinfo&start=30&hl=en&rnum=31&selm=5rr0id%245qt%241%40nnrp.cs.ubc.ca). I have only used the kmem technique on Solaris 2.5 and 2.6, because libdevinfo offers very little (except perhaps being a little more convenient). It uses kmem and therefore inherits all the problems mentioned previously. Furthermore, it doesn't offer the same amount of useful functionality as its successor, which I will discuss shortly.

It may come as no surprise to learn that the more recent libdevinfo does not access kernel memory through kmem. Instead, Solaris 7 and 8 include a special devinfo driver, which exports a devinfo device that is addressed through an exported device special file called /devices/pseudo/devinfo@0:devinfo,ro. Routines in libdevinfo can be used to retrieve a complete and consistent snapshot of the device tree through this file and root privileges aren't even required under Solaris 8. The other library routines can then be used to interrogate the snapshot and read information from the tree.

Using libdevinfo

Listing 2 shows a fairly simple program that demonstrates many of the features and functions available through libdevinfo, which I'll use to discuss how the library can be used. I'll also present a simpler program that uses libdevinfo to solve a particular problem. The listing can be compiled using the GNU C compiler (gcc) with the following command (assuming the source is saved as devinfo.c):

gcc -g -o devinfo devinfo.c -ldevinfo
The output of the program is not pretty, but it is information-packed and helps illustrate what is in the device tree. Some heavily edited output from the program looks like this:

SUNW,Ultra-60 bind(SUNW,Ultra-60) bus() drv(rootnex) inst(-1) state() \
drvprop(pm-hardware-state) sysprop(relative-addressing)
   packages bind(packages) bus() drv() inst(-1) state(drvdet) \
   terminal-emulator bind(terminal-emulator) bus() drv() inst(-1) \
   state(drvdet)
   pci bind(pci108e,8000) bus(1f,4000) drv(pcipsy) inst(0) state() \
   drvprop(interrupt-priorities) cminor(devctl ddi_ctl:devctl)
      network bind(SUNW,hme) bus(1,1) drv(hme) inst(0) state() \
      hwprop(latency-timer) hwprop(cache-line-size) cminor(hme ddi_network)
      scsi bind(glm) bus(3) drv(glm) inst(0) state() drvprop(target6-TQ) \
      drvprop(target6-wide) hwprop(latency-timer) cminor(devctl \
      ddi_ctl:devctl:scsi) cminor(scsi ddi_ctl:attachment_point:scsi)
         sd bind(sd) bus(0,0) drv(sd) inst(0) state() \
         drvprop(inquiry-vendor-id) sysprop(lun) sysprop(target) \
         bminor(a ddi_block:channel) cminor(a,raw ddi_block:channel)
The program initially calls di_init() on line 92 to get a snapshot of the current device tree from the base of the tree (/), including all of the subtree with all property and minor node data (DINFOCPYALL). At this point, the library opens the devinfo device and uses ioctls to cause the devinfo driver to lock the device tree against modifications, then create a copy of the device tree and copy it to user space. The function returns the first node in the copied tree in the form of a di_node_t. The library provides a suite of routines for reading information from the node so it isn't normally necessary to know anything about the contents of the di_node_t.

The root node is then given to the examinenode() function on line 98, which dumps information from the node to stdout. The general information from the node is dumped on line 59. The name, binding name, bus address, driver name, instance number, and state are shown, all of which are retrieved easily through the libdevinfo routines di_node_name(), di_binding_name(), di_bus_addr(), di_driver_name(), and di_instance(). Note the conditional expressions used for the bus address and driver name -- remember that the device tree nodes can describe pseudo devices that have no bus address, and devices that have no driver in the kernel and, thus, no driver name.

The final piece of general information, the state, is retrieved through di_state() on line 66 and then passed to the showstate() function for display because the state is a set of individually meaningful bits represented in an unsigned integer. showstate() simply decodes the bits and outputs meaningful states.

Following the basic data about the node, the program prints all of the available properties for the node. This is achieved through the "while" loop on line 68. When di_prop_next() is called with tProp equal to DI_MINOR_NIL, it returns the first property and each successive call returns the next property until none remain, at which point DI_MINOR_NIL is returned.

For each property, showprop() is called. showprop() first looks into the property structures to determine which property list they are from -- hardware, system, or driver. This information is not directly provided by libdevinfo and is determined by calculating the properties offset from the first offset for each of the three lists in the snapshot of the device tree (lines 28-35). The routine then prints a prefix indicating the type of property along with the property name from the di_prop_name() routine on line 37. Note that determining the property type this way isn't necessary a good idea, because it introduces a dependency on internal libdevinfo structures that may change in the future. However, I've shown it as a demonstration of how data above and beyond that provided by the libdevinfo routines can be retrieved directly from the structures.

The routine then prints the minor node information in examinenode() if there are any minor nodes attached to the current device tree node. The code loops through all of the minor nodes on lines 74-76 using a loop exactly like that used to examine the properties in lines 69-71, except that di_minor_next() is used. For each minor node, showminor() is called to dump the minor node information.

showminor() is quite simple. It initially uses di_minor_spectype() to determine whether this device special file is a character or block device on lines 41-44 so it can prefix the output with "c" or "b". Then, it prints the name of the minor (using di_minor_name()) and the device type (using di_minor_nodetype()) on lines 46-47.

At this point, most of the information in the node has been successfully dumped. The routine then recursively loops through the rest of the tree by calling examinenode() on each of the child nodes. On line 81, it calls di_child_node() to get the first child node (if any exists). It then goes into a "while" loop in which it calls examinenode() on the current child node (line 83). This recursion will cause the child node and all of its children to be dumped, since this call to examinenode() will in turn call examinenode() for all its children. di_sibling_node() is called on line 84 to get the sibling of the current child node, and if no further children remain, DI_NODE_NIL will be returned. Thus, the loop will terminate on line 81. Otherwise, the loop continues until all children (and their children) have been dumped.

Although the output is not particularly clean, it is quite informative. For example, it's reasonably easy to see the block minor exported (sd@0,0:a) from the first SCSI disk. In the next section, I will refine this information retrieval process to develop a short program that lists all of the CD-ROM devices attached to the system.

A Useful Sample Tool

Listing 3 uses the device tree to list all the CD-ROM devices attached to the system by finding CD-ROM minor nodes. As before, the program first gets a snapshot of the tree on line 41. Once the snapshot has been generated, a libdevinfo utility function di_walk_node() is used on line 47 to walk the tree, it takes in a root node, and loops through the entire subtree calling a specified function with each node it finds. In this case, the parameters specify that the traversal should start from the root of the tree, traverse with children before siblings, and call the checknode function for each node.

checknode is a relatively simple function that loops through all the minor nodes of the device node using di_minor_next() on line 21, as in the previous sample program. For each minor node, it determines the node type with di_minor_nodetype() and checks whether it is DDI_NT_CD or DDI_NT_CD_CHAN on lines 22-24. These two types are used to define CD-ROM minor nodes. If it finds a minor of those types, it knows this device must be a CD-ROM and needs to be displayed. This code could be easily changed to list hard disks by instead checking for DDI_NT_BLOCK or DDI_NT_BLOCK_CHAN.

At this stage, the routine tries to find the administrative name for the CD-ROM (e.g., c0d0t6s0) since the /devices filename is less intuitive for a user. To find the administrative name checknode() calls ftw() on line 26. ftw() is a standard function in the C library that walks a file tree calling a specified function for each file found. In this case, the ftw() call traverses the /dev tree calling showpath() for each file found.

showpath() simply compares the dev_t (or the major and minor) for the file to the dev_t specified in the device tree minor node on line 9. If they are the same, then this file is a block special file for the CD-ROM in question and is thus the administrative device special file and name for the CD-ROM. In this case, the function prints the name of the file and returns 1, which will cause the ftw() call to end. If the dev_ts do not match, the function simply exits with 0, thus causing the ftw() file tree walk to continue.

Back in checknode(), if the file tree walk did not find a matching file (i.e., it returned 0), there must not be an administrative node for the CD-ROM in /dev. As discussed previously, a fully functional device does not need to have an entry in /dev but will always exist in the /devices filesystem. In this case, the function calls di_devfs_path() on line 27 to get the /devices path to the CD-ROM and prints that instead of the administrative name. Having displayed the name, di_devfs_path_free() is called on line 29 to free the storage allocated by di_devfs_path() to store the path.

Conclusion

The Solaris kernel readily exports an amazing array of information about the device configuration of the machine on which it is running in the form of the device tree. Sun has gone to great lengths to make that information accessible with the devinfo library. This article has demonstrated an array of available functionality, but has not covered the depth of functionality provided. One thing is for sure -- administrators and systems programmers that use libdevinfo to mine the device tree will find that it's an amazing source for hard to find information about devices -- from disks to network cards and pseudo devices.

Shaun Clowes has worked in UNIX systems administration for a number of years and is currently working as a systems programmer on a variety of UNIX platforms, including Solaris, his favorite. He can be contacted at: shaun@yap.com.au.