Article
Listing 1
Listing 2
Listing 3
Listing 4

oct2002.tar

Small Databases: From Desktop to Web Server

N. Dean Pentcheff

Some databases are large and dynamic, demanding an integrated system where the Web front-end and the database back-end share a common data store. However, many small databases require less frequent updates and require no updates from the Web at all (examples include phone lists and reference information). Typically, those small databases have been cobbled together on desktop systems using a variety of software, such as Access, Filemaker, and Excel. Because those maintaining the database often lack experience in information technology, it is desirable to simplify these updates as much as possible.

It is certainly possible to make "live" connections from a Web server to some desktop packages so that the data on the internal network is accessible through a Web front end. Several factors make this approach impractical in many cases: technical difficulties of engineering ODBC-to-UNIX connectivity, security concerns of a direct network connection to a personal workstation, and the requirement that the database server must be running for the data to be available. In practice, it's sometimes easier to provide a complete copy of the data to the Web server rather than link dynamically from the Web server to the database on the internal network.

The system described in this article allows users to informally develop and maintain a small database using the desktop tools they prefer and automatically copy this data to a Web-accessible version of the database on the Web server. The basis of the system is to use comma- or tab-separated dumps of the entire database ("CSV" files) to update the Web-accessible version. The core is a Perl program. This program reads a configuration file that describes the data incoming from the desktop machine (uploaded via a Web form) and updates the Web-accessible version in a MySQL (or similar) database.

This system moves the data from the user's desktop database to the UNIX SQL-capable database. It is not involved with the actual Web presentation of the data -- that is assumed to be performed by other systems on the Web server. Here I will focus solely on making it easy for a small-scale database manager to get his or her data into a database so that a Web server can present it.

Examples

A couple of examples may help to illustrate the types of applications where this approach is useful. In the first example, a small desktop database containing a technical glossary is made available on the Web in a searchable form (rather than in its previous incarnation as a word-processing document):

http://crustacea.nhm.org/glossary

In this case, there are two tables (one holding the glossary terms, definitions, cross-references, and synonyms; and another holding the reference authorities). This database is largely "read-only" and is updated only occasionally by the maintainers. On the desktop, it is maintained as an Access database.

The second example is the collection information for the Marine Biodiversity Processing Center of the Natural History Museum of Los Angeles County:

http://collections.nhm.org/list.html

In this case, the source data is maintained in a Macintosh Filemaker database. This more complex example has seven tables, covering the characteristics of each collection, the condition of the specimens through time, the updates to the label information, etc. By making this information publicly available, rather than keeping it solely as an internal document, the museum can inform curators about the progress on their collections, and inform the public about what happens to specimens in the Museum's collections.

Assumptions of the System

Several assumptions affected the design of this system. The database maintenance software must generate comma- or tab-separated dumps of the data. The interchange format is comma- (or tab-) separated text; hence the system that maintains the database (probably a desktop database or file) must be able to generate this style of output. It is possible to sidestep this requirement if additional processing on the server end can convert the dump into a comma-separated format. I'll to refer to these as "CSV files", although they can be a bit more diverse than that.

The database is small. For simplicity, the entire database (all records) is uploaded and inserted into the Web server's database each time there is an update. Hence, the time this takes is a limiting factor. Clearly, the speed of the processor, available memory, and network speed all influence the practical limit. In one example, we are running a single-processor, 233-MHz server with 128 MB of RAM. It takes a few minutes to upload a 50,000-record database with about 20 fields per record.

The Web-accessible version needs infrequent updates. Since the entire database is uploaded with each change, very frequent updates become unwieldy. Taking a minute or two each day or week to manually update the Web version of the database is convenient; doing so every 10 minutes is impractical.

The Web-accessible version is read-only. Data flows in one direction -- from a desktop or workstation to the Web-accessible database. There is no provision for directly incorporating changes from the Web, therefore this system is appropriate for "publishing" data onto the Web, but it is inappropriate for any interactive Web application where updates come from Web forms.

Solution Architecture

Data Path

The next few paragraphs provide an overview of the flow of data through this system. The data starts in some sort of application that probably resides on someone's desktop workstation, which might be an Access or Filemaker database, or even a carefully maintained spreadsheet. The first step is to generate (on the desktop computer) a CSV file that is a dump of the whole database contents.

In the next step, a simple Web form is used to upload that text file to the Web server and initiate processing there. I'm currently using a Perl CGI script, but there are few constraints on how this must be done. It would even be reasonable to FTP the file there and manually start the conversion program. The only important consideration is to use the security features of the server to limit access to this update method to authorized staff.

Next, on the server, a Perl program processes the data and inserts it into the server's database. The program reads a configuration file that describes the incoming data and the target database, then reads the input CSV file and performs all necessary conversions. The final home for this data is a database that's directly accessible by the Web server, so the data is "visible" to dynamic applications running on the Web server. There are a number of databases that could fit the bill, ranging from simple and free to complex and expensive. I'm currently using the MySQL database, but little (if any) change would be needed to use any database supported by Perl's DBI database interface system. As the data is converted, it is inserted into a new table in the server's database. Upon completion, the existing table (if any) is renamed to serve as a backup, and the new one is renamed to replace it.

The CSV2db Perl Program

This program is the core of the system, responsible for all data conversions as well as inserting the data into the Web-accessible database. It is implemented as a small collection of Perl modules that do all the real work. The CSV2db::Update module contains the code for the main processing object. A new processing object is created (with initial values setting various attributes), and a file handle to the CSV data is handed to the table-processing routine. The whole "wrapper" around the module could be as short as:

 #!/usr/bin/perl -w
 use strict;
 use IO::File;
 use CSV2db::Update;
 my $fh = IO::File->new('mytable.csv');
 my $update =
     CSV2db::Update->new(
         configfile => 'mydatabase.config',
         dbname     => 'mydatabase',
         dbid       => 'dbwriter',
         dbpw       => 'WritePass');
 $update->table('mytable', $fh);

In this example, the name of the database, the database user with appropriate privileges to perform the update, and that user's database password are all hard-coded as initialization arguments. CSV2db::Update's "new" method does nothing but initialize the object. The attributes can also be set using accessor notation after the object is created, for example:

$update->dbname('mydatabase');

Other attributes that can be set include:

dbhost (defaults to localhost) -- The name of the database host computer.

dbdname (defaults to mysql) -- The Perl DBD driver to use when connecting to the database engine.

csvquotechar and csvfieldsep (default to the double-quote character and comma, respectively) -- Specifies how to interpret the CSV file.

fake -- If set to a true value, allows all the processing to occur, but prevents the actual database update.

The workhorse of the system is CSV2db::Update's "table" method. As shown previously, it's called with an open filehandle to the incoming CSV data. The interpretation of the incoming CSV data is controlled by the configuration file that the CSV2db::Update object reads. I'll cover that configuration file in more detail later. Briefly, though, that file includes a list of all tables in the database that can be updated. For each table, each of the fields incoming in the CSV file is named, its database type is determined, indexing is specified, and any special processing for the field is given.

It is also possible to specify pre-processing of the whole CSV file in the configuration file. This can be particularly useful to deal with line-ending conversions or to convert file formats that are too far from a plausible CSV format to be read directly. To achieve this pre-processing, the "table" method internally filters the entire file through the desired routines into a new temporary file.

Based on the specifications in the configuration file, the "table" method creates SQL statements to properly create the new version of the data table and goes ahead to create it (using a temporary name, in case something goes awry). An appropriate SQL "INSERT" statement is created to prepare for the data loading. Finally, record-by-record, the CSV file is read and processed. Each field may optionally have special processing performed on it (e.g., converting character sets, fixing date formats, trimming blanks, etc.). Once the fields are ready, each record is loaded into the new table.

Three ancillary modules currently are defined that provide routines useful for specific pre-processing or field-level processing. The CSV2db::Filemaker module (Listing 1) provides short routines useful for converting dates as exported by Filemaker into SQL-acceptable dates, as well as converting some characters internal to Filemaker records into more useful versions. The CSV2db::Mac module (Listing 2) contains the routine (macroman2iso) that does a conversion from the Macintosh-specific character set to an approximation of the ISO-8859-1 character set. Finally, the CSV2db::Util module (Listing 3) contains a few other routines useful for fixing line endings, trimming blanks, and testing. (All listings for this article are available for download at: http://www.sysadminmag.com/code/.)

The Configuration File

The configuration file is at the heart of this system. It is here that the incoming CSV datafile is described, the table fields are named and described, database indexing is specified, and any additional processing is prescribed. For maximum flexibility, I chose the strategy that the configuration file is a parseable chunk of Perl code. This avoids the problem of developing a specialized parser. It also cleanly enables a facility for including code directly in the configuration file. A short example follows:

 {
  preprocess => [],
  postprocess => [\&CSV2db::Mac::macroman2iso],
  options => {fake    => 1,
              verbose => 1,},
  tables => {
    # tablename =>
    # [[fieldname, fieldtype, index, fulltext, proc],...]
    mytable =>
      [["uid",   "varchar(20) primary key", 0, 0],
       ["name",  "varchar(200)",            1, 1],
       ["birth", "date",                    1, 0,
                   \&CSV2db::Filemaker::date2sql],
       ["remark" "text",                    0, 1],
      ],
    another_table =>
      [["Taxon",     "varchar(200)", 1, 0,
                   sub {$_[0] = uc $_[0]}],
       ["Locality",  "text",         0, 0],
       ["Collector", "",             0, 0],
       ["Depth",     "double",       1, 0],
      ],
   },
 }

The opening and closing braces are there just for Perl's parsing happiness. Within them, four keys are permitted in a configuration file. The "preprocess" key (empty in this case) signifies an array of pointers to routines that can pre-process the entire incoming CSV datafile. These routines will be called in the order listed, with an open filehandle to the CSV data as their first argument and an open filehandle to a new temporary file, which they should fill with the processed data.

The "postprocess" key signifies an array of pointers to routines that are run on every single field in the incoming CSV data before it's inserted into the database. In this example, a routine that converts from the Macintosh character set to an ISO-8859-1 character set is specified. Any routine specified here will be given the field contents as a single argument and is expected to return the digested field contents as a single return value.

The "options" key signifies a hash of option settings. These provide another way to modify the attributes of CSV2db::Update object. Finally, the "tables" key signifies a hash of keys, one per database table to be specified. Each key here is the name of a table and points to an array of field definitions. Each field definition is, itself, an array. The first element is the name of the field. Next is the database type that this field should have (these values, of course, may be database-engine dependent). If the database-type field is empty, this field will not be included in the database. The third value, if true, indicates that this field should be indexed in the database. The fourth field, if true, indicates that (if the database engine supports it), the field should be part of a special full-text index for the table. The fifth (and optional) field is a pointer to a field-specific processing routine.

With these field-specific processing routines (and their related "postprocess" routines), things get interesting. Since we are allowing Perl to parse the configuration file, we can either insert a pointer to an existing routine (as in the "birth" field above), or insert the actual code as an anonymous subroutine (as in the "Taxon" field, where a code snippet is used to force the field to uppercase on the fly). This capability allows us to easily take care of a range of problems. Minor "fix-ups" to data formats can be coded as routines and saved in modules (like "Mac.pm" and "Filemaker.pm") to be called as needed. Alternatively, short (or long) conversion routines that are particular to one application's data can be inserted directly into the configuration file, avoiding the complexity of carrying around modules of limited usefulness.

Simple Web Interface for the CSV-File Uploads

The last element in the system is the front-end presented to the database maintainer for uploading the CSV files to the system. It is convenient to implement this as a CGI program presenting a Web page that allows for file uploads. For security reasons, the Web server must enforce access controls on this page, so that only authorized users can use the page.

There's nothing complex about this part of the system -- it simply presents the user with a set of file upload form elements (one for each table of the relevant database). The table number and names are based on the same configuration file as the CSV2db program reads. Once uploaded, the CGI program is responsible for initializing the CSV2db::Update (Listing 4) object with the appropriate parameters, and calling the "table" method to process the data (as shown in the short "wrapper" example above).

Now that I've reviewed the flow of data and the structure of the conversion process, I'll describe a few of the issues that I had to tackle. Some of them were obvious from the start, but others were encountered along the way.

Coping with "Comma-Separated Text" Files

There is no standard for comma-separated text formats -- it's a pragmatic format that is partially reinvented by every application. By using the Text::CSV_XS module by Jochen Wiedmann (available through the CPAN at http://search.cpan.org), most problems were sidestepped. That pre-existing code takes care of variations in the quote and field-separator characters (configured in CSV2db using the csvquotechar and csvfieldsep settings). More interesting, though, was the discovery that some programs generate CSV files that have newlines embedded within fields, or that decide to be "helpful" and wrap long records into multiple lines. The Text::CSV_XS module, tolerant though it is, does insist on one record per line. To handle that problem, a fairly intricate piece of code was developed that continues reading data in the event of an incomplete field or record. Split fields are, in effect, glued back together; split records are reassembled, and the location in the file is checked for synchronization with the beginning of the next line or record.

Character-Set Translation

Some of our data originates in database files on Macintosh computers. The Macintosh operating system uses character encoding known most commonly as "MacRoman", rather than the more common ISO-8859-1 character set. (When the Macintosh character set was devised, there really weren't any international standards for character sets. The ISO classification was developed after MacOS.) Glyphs in MacRoman other than standard low-order ASCII will be misrepresented by other systems. This problem, of course, is not restricted to Macintoshes, but applies to any situation where the character set of the original data differs from that of the target database.

In some cases the solution is simple, for example, when there is a one-to-one mapping of all glyphs in the original character set to glyphs in the target character set. In the case of MacRoman, however, glyphs exist that have no representation in the target character set (ISO-8859-1). The solution chosen was to use a routine that performs a "pragmatic" translation. For the "missing" characters, a textual equivalent is generated. Examples include substituting the text "infinity" for the infinity symbol and "(TM)" for the trademark symbol.

Other "Fix-Ups"

Depending on the source, numerous minor processing steps may be needed to preen the data into final format. The ability to specify arbitrary processing on a field-wise (or even file-wise) basis is an indispensable feature of the configuration file.

A few examples come from using Filemaker as a data source. Filemaker encodes newlines embedded within text fields as control-K (ASCII character 13) characters. Fields where this may be an issue must have those characters translated into legitimate newline characters before input into the online database. Similarly, Filemaker uses a control-K separated list of words to represent a set of multiple checkboxes in a single field. By translating the control-K characters to commas in this case, those fields can be used as "SET"-type data for MySQL.

Dates and times often need reworking. The translation routine applied in this case relies on the very permissive str2time routine from Perl's Date::Parse module (available on CPAN). It will correctly read a broad variety of dates. Once read, a date can then be processed into the desired "YYYY-MM-DD" format. Times represented in AM/PM style can be processed into the desired 24-hour format.

MySQL "SET"-Type Fields

As mentioned briefly earlier, we take advantage of a MySQL extension to standard SQL -- a "SET" data type. Quoting from the MySQL documentation, "A SET is a string object that can have zero or more values, each of which must be chosen from a list of allowed values specified when the table is created." Storage and retrieval of SET values is particularly efficient in space and time.

What's the problem? The need to know the allowed values at the time the table is created. All values that are used occur in the incoming data file, so it seems unfortunate to require that they be specified in the configuration file as well. The solution adopted is to simply read the incoming data twice (in the event of any SET definitions). The first pass is used solely to accumulate lists of allowed values for each SET field so that the appropriate commands can be issued during the table creation. The second pass inserts records into the database. This solution trades off an increased initial processing time against an error-prone requirement that all allowed values must be specified in the configuration file.

Drawbacks and Advantages

This approach (creating online-accessible databases from CSV dumps of desktop data) has the limitation of not directly accommodating input from the Web, or indeed from any source other than the original data repository. It is a publication-oriented method, not a way to create a multiple-source interactive database. Its scalability is also limited by the speed of transmitting and processing the entire database on each update. One advantage, however, is the system's flexibility to accommodate a broad range of input sources. Those sources are not required to be available on the network at all times, and the input data format is quite flexible. This accommodates users' favored tools rather than imposing a single solution. In many cases we've tackled, there would have been no chance to get the data onto the Web if we insisted that the users recreate their data sources in some other application.

A related feature is the simplicity, from the users' perspective, of getting their data to the Web. They have control of when updates occur, and the update process is as simple as generating a data dump with their workstations and uploading files on a simple Web browser form. All the picky work is done on the server end.

By isolating the users' data sources from the network-available version of the database, a raft of security problems are neatly sidestepped. Certainly the online version of the database should be protected from malicious or inadvertent corruption. However, corruption of the online version has no effect on the core source data. This accommodates environments where users cannot be compelled to maintain high security on personal servers, or where their operating systems may be inherently non-secure. As long as the source data is backed up, they're safe independently of whether they're also available on the Web. For the limited but common cases where this approach is applicable, we've found it to work very well.

Dean has worked with UNIX in a variety of academic research settings since 1983. He currently splits his time between research in marine biomechanics and academic systems administration. He can be reached at dean2@biol.sc.edu, and will make updated versions of this program available at http://collections.nhm.org/code.