Cover V11, I01

Article
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

jan2002.tar

RDB -- A UNIX Command-Line Database

Mark Pruett

Sometimes, even a small database is too big. Databases such as mySQL and PostgreSQL, while lightweight and easy to install by commercial standards, may still be more than you need for some straightforward data analysis problems. Some of these databases may also require root access to install, which may not be practical in some environments.

RDB, a relational database system that operates on plain text files using the UNIX command line, is a simple but powerful alternative. It can be quickly downloaded and installed and doesn't require root access. Furthermore, it doesn't use a daemon process to mediate between the data tables and the user.

Walter Hobbs of the RAND Corporation wrote RDB, based on ideas from an article published in the March 1991 issue of Unix Review magazine ("A 4GL Language"), and from the book UNIX Relational Database Management, by Rod Manis, et. al. (Prentice Hall PTR/Sun Microsystems Press). Walter Hobbs has released RDB to the public domain, so it's freely distributable.

RDB Files

RDB differs from most other relational databases in several ways. One difference is that RDB stores data tables as ASCII text files. A more traditional database system, like PostgreSQL, stores data tables in a binary format in files that are hidden from the user. The user accesses the data indirectly, through special programs or through APIs (Application Programming Interfaces).

While insulating the user from direct access to the data files has advantages, it also makes that data more difficult to move from place to place. With RDB, data tables can be moved anywhere, even to other computers, by simply copying the data files. Other relational databases often require you to export the table (ironically, to a text file), then import the file into the new database. RDB tables can be created with any text editor. Figure 1 (people.rdb) shows an example of a simple RDB table.

The first line of the table in Figure 1 is a comment. Comment lines are optional and there can be more than one, but they must come at the beginning of the file, and they must begin with a "#" character. The second line describes the table's column names. Starting with the second line, all data columns are separated by a tab character. In Figure 1, tab characters are denoted by "<t>". The people.rdb table has four columns: LastName, FirstName, Age, and Phone.

The next line describes the table's column types. Traditional databases have many column types, including strings, integers of varying widths, floating-point numbers, dates, and even BLOBs (Binary Large OBjects). RDB has relatively few: string, numeric, and month. The people.rdb table contains mostly string columns, with Age being the lone numeric field. Each tab-separated value describes the column type for its corresponding column from the previous line, so LastName's column type is "25S", FirstName's is "10", etc. RDB uses the letters S, N, and M to describe the three column types: string, numeric, and month. String is the default column type.

The integer value describes the column's width, but this is not an enforced limitation. Instead, the number is used by some of RDB's output formatting commands. Formatting is also the purpose of the ">" character in Phone's field type. The ">" and "<" indicate that a column should be displayed either right or left justified.

RDB data can contain any ASCII characters except for new-lines and tabs and there are no length limitations. The remaining lines of people.rdb are data lines, which are also tab separated. All RDB tables follow this basic format.

Using RDB

Another difference between RDB and traditional relational databases is that it uses a set of command-line programs to manipulate RDB data. Databases such as mySQL access data using Structured Query Language (SQL). SQL uses a SELECT statement to extract information from a data table. For example, if the information from the people.rdb table in Figure 1 was stored in a mySQL database table called people, a SELECT query might look like this:

SELECT FirstName, LastName, Phone
FROM people
WHERE LastName = "Smith"
This query tells mySQL to display three columns (FirstName, LastName, and Phone) from the people table, but only those rows of the table where the LastName field is equal to "Smith". The resulting table would look like this:

FirstName  LastName  Phone
---------  --------  --------
John       Smith     555-2416
Jane       Smith     555-3499
The same results are possible from RDB using two of its command-line programs and UNIX's pipe-and-filter mechanisms:

row LastName eq "Smith" < people.rdb | column FirstName LastName Phone
RDB's row command is equivalent to the WHERE clause in SQL. The column command is similar to the column list that follows the SELECT keyword in the SQL example. It tells RDB which columns to display, and in what order. Notice that the Age field is not listed and therefore it won't appear in the output.

The row command uses the following comparison operators: gt, ge, lt, le, eq, ne, mat, nmat. The last two operators are for regular expression matching. The keywords "and" and "or" can be used with the row command to build more complex queries. It's important also to note that each RDB command expects a valid RDB table as input, and emits a valid table as output. This allows complex database queries to be built by stringing together a collection of RDB commands.

In addition to the row and column commands, RDB includes commands to join tables (jointbl), sort tables by a specific column or columns (sorttbl), and find unique rows by column (uniqtbl). Other RDB commands, such as mergetbl, serve the same purpose that INSERT, UPDATE, and DELETE statements do in SQL, inserting, updating, or deleting rows from an RDB table.

By leveraging the power of the UNIX command line, RDB makes it easy to perform complex data manipulations. Complex queries can be developed and debugged on the command line. These queries can then be saved as shell scripts. This simplifies extending RDB for a specific set of problems.

Joining Tables in RDB

All SQL-based databases allow you to join two tables on a column. RDB also lets you do this. The RDB table in Figure 2 has two columns -- Age and Comment. To join this table to the people.rdb table, use the jointbl command like this:

jointbl Age=Age ages.rdb < people.rdb
The first parameter indicates that we want to join the two tables on the Age column. The second parameter is the table to which we want to join. The first table, people.rdb, is expected from standard input. The result of the jointbl command is shown in Figure 3. Notice that the result is yet another RDB table, which means that the result of this join can be piped into subsequent RDB commands for further processing.

The new table has five columns -- the four from people.rdb, and the Comment column from ages.rdb. But, where people.rdb had three data records, this table only has two because no age in ages.rdb matched the record for Jane Smith. The join only outputs records where the columns matched.

There may be times when you want all the records from the first table to be output, even if they didn't match anything in the second table. This is a master-detail or outer join. To make an outer join with RDB, use jointbl's -md switch:

jointbl -md Age=Age ages.rdb < people.rdb
The table produced (Figure 4) has all three records from the first table, but the Comment column for Jane Smith's record is null. Notice the trailing tab character in the last record.

Formatting RDB Output

While RDB commands operate on simple text files, these tab-delimited files may not be the best way to present information to the user. RDB provides two commands that simplify producing reader-friendly reports. The first and simplest of these is the ptbl command. The ptbl command takes an RDB table from standard input and emits a fixed-column report as output. For example, running our example table through ptbl, like this:

ptbl < people.rdb
gives us output that looks like this:

LastName         FirstName   Age         Phone
---------------  ----------  ---  ------------
Smith            John         25      555-2416
Adams            Frank        27      555-1378
Smith            Jane         34      555-3499
Column widths and alignments are determined by the "column type" line of the RDB table.

The reporttbl command allows more complex report layouts. It uses a separate form file that describes the report design. Figure 5 shows a sample form file, people1.frm, that could be used with the people.rdb table.

The form file describes how to format each row from the RDB table. In this example, each table row will be output onto two lines, the first showing the person's name, and the second showing their age and phone number. Fields beginning with an ampersand character are picture fields, which describe the placement and alignment of individual columns. Following the ampersand will be one or more alignment characters -- "<" for left alignment or ">" for right alignment. On the line immediately following the picture fields is a line listing the RDB column names to which each picture field corresponds.

The form file is passed as a parameter to the reporttbl command, with the RDB table received via standard input:

reporttbl people1.frm < people.rdb
This command yields a report formatted like this:

Name: John      Smith
Age:    25      Phone: 555-2416

Name: Frank     Adams
Age:    27      Phone: 555-1378

Name: Jane      Smith
Age:    34      Phone: 555-3499
The reporttbl form file allows you to define optional page headers, as well as to embed the output from external commands such as date or whois into a report.

Uses for RDB

Programmers who write applications that generate log files can store those logs in RDB format, then use RDB to easily query those logs. RDB files are simply tab-delimited text files, so it's a simple matter to produce log files in RDB format. The hassles of parsing log files are reduced by breaking log files into RDB tables with columns like Date, Time, ApplicationName, Severity, and MessageText.

If you're writing CGI apps in either Perl or PHP, it's simple to read RDB tables and then display the data, which makes RDB easy to use for Web applications. I've used RDB extensively for archiving data collected from electrical substations in near real time. Electrical utilities monitor large numbers of devices, such as breakers and transformers, and often need to save that data for later analysis. RDB works well in this role. Existing log files can easily be converted to RDB format. Once converted, these logs can be queried by date ranges, application name, message content, or any combinations of these fields.

Installing RDB

Installing RDB couldn't be much easier. Retrieve the tarball from the ftp site (ftp.rand.org/pub/RDB-hobbs/RDB-2.6d.tar.gz), unzip and untar it, cd into the rdb directory and type make install. A collection of Perl programs is installed, which means you need to have Perl installed on the target machine. Beyond that, there aren't really any prerequisites. By default, RDB tries to install to the /usr/local/bin directory, but this can be modified by editing the INSTALL_DIR line in the RDB Makefile.

There is one problem you may need to watch for: one of the RDB commands is called "column". This may conflict with the "column" utility included with many UNIX and Linux distributions (in particular, "column" is included with the util-linux package and installed with the Red Hat Linux distribution). You can avoid the problem by renaming one of the column utilities.

RDB doesn't have any daemon running, and there are no databases to create, so once it's installed, it's ready to use. All that remains is to create your RDB data tables, which you can do with a simple text editor.

Limitations of RDB

In practice, RDB works best on databases with predictable growth. Log files grow by appending new records to the end of the file, which is one reason they work well as RDB tables. Likewise, data acquisition applications, where time-stamped data is continuously collected and archived, work well as RDB tables. Both these types of data files grow constantly, but it's seldom necessary to go back later and modify the records.

Also, because of the sequential nature of text file access, large files can be cumbersome within RDB. What constitutes "large" depends on the speed of the file system, the processor, and how long you're willing to wait for a query to complete. One advantage here is that once a file has been queried, subsequent RDB requests on the same file are likely to hit a cached copy of the file in memory. This can significantly cut down on query times. RDB has an indextbl command that can create a separate index file for an rdb table. This can also improve access times, but the index must be recreated each time the table is modified.

If a large amount of data is collected, disk storage may become a problem. Luckily, the UNIX philosophy ("small tools that do one thing well") helps solve this problem too. RDB tables can be compressed, using the gzip utility. This can reduce them to as little as a tenth their original size. When you need to query these files, you can simply access them with zcat, which uncompresses them and sends them to standard output.

Advantages of RDB

RDB tables are simple text files, so the whole UNIX command-line bag of tricks is available when using them. Collections of RDB files can be searched with grep. Output from RDB tables can be mailed using a command-line mail program. Repetitive sequences of RDB commands can be turned into shell scripts.

I use RDB to maintain a table of host names and IP addresses. The table also includes information about host name aliases, the physical location of the hosts, the purpose of the host (e.g., this host is used primarily as a Web server, and that host is a mail server), and the host's machine architecture (Intel, Alpha, etc.).

This data is used by several different programs, but primarily it's used to generate files used by our domain name servers.

My name server requires two files -- one that maps hosts to IP addresses, and another that handles reverse mappings from IP address to host name. I wrote a short Perl script that generates both these files from my hosts' RDB table. That way, I only have to update one file. The portability of RDB means that if I ever have to move the name server to a different machine, I can also move the hosts' RDB table with a minimum of fuss.

RDB can take advantage of the hierarchical directory structure of UNIX to speed data access and minimize file size. Date-stamped data can be stored in directories that indicate the date. For example, a log file for June 10th, 2001 can be stored in the directory /home/xyz/2001/06/10.

RDB data is trivial to exchange with other users. An RDB table can be emailed, intact, as an attachment and once received, it's ready to use. Because data is stored as simple ASCII text, it can move easily between different machines and different operating systems. RDB data files can be imported or pasted into spreadsheet applications such as Microsoft Excel or Star Office.

RDB is easy to share over the Web. You can store RDB tables in a Web directory, then access and query them remotely using a program like lynx. Lynx allows you to dump the source of a Web page (or in our, case an RDB table) to standard output, so lynx can be used to quickly access data over the Internet. You can even join a remote table to a local RDB table.

RDB commands are built on the command line, and the ease of this approach lends itself to experimentation. If you're comfortable with the UNIX command line, and you routinely need flexible data analysis tools, give RDB a try. You can have the power of a relational database without giving up the flexibility of the UNIX command line.

Mark Pruett received a Masters Degree in Computer Science from Virginia Commonwealth University. He has been programming for the past 18 years. Mark currently works for Dominion Virginia Power. He can be reached at mpruett@mediaone.net.