Small
Databases: From Desktop to Web Server
N. Dean Pentcheff
Some databases are large and dynamic, demanding an integrated
system where the Web front-end and the database back-end share a
common data store. However, many small databases require less frequent
updates and require no updates from the Web at all (examples include
phone lists and reference information). Typically, those small databases
have been cobbled together on desktop systems using a variety of
software, such as Access, Filemaker, and Excel. Because those maintaining
the database often lack experience in information technology, it
is desirable to simplify these updates as much as possible.
It is certainly possible to make "live" connections
from a Web server to some desktop packages so that the data on the
internal network is accessible through a Web front end. Several
factors make this approach impractical in many cases: technical
difficulties of engineering ODBC-to-UNIX connectivity, security
concerns of a direct network connection to a personal workstation,
and the requirement that the database server must be running for
the data to be available. In practice, it's sometimes easier
to provide a complete copy of the data to the Web server rather
than link dynamically from the Web server to the database on the
internal network.
The system described in this article allows users to informally
develop and maintain a small database using the desktop tools they
prefer and automatically copy this data to a Web-accessible version
of the database on the Web server. The basis of the system is to
use comma- or tab-separated dumps of the entire database ("CSV"
files) to update the Web-accessible version. The core is a Perl
program. This program reads a configuration file that describes
the data incoming from the desktop machine (uploaded via a Web form)
and updates the Web-accessible version in a MySQL (or similar) database.
This system moves the data from the user's desktop database
to the UNIX SQL-capable database. It is not involved with the actual
Web presentation of the data -- that is assumed to be performed
by other systems on the Web server. Here I will focus solely on
making it easy for a small-scale database manager to get his or
her data into a database so that a Web server can present it.
Examples
A couple of examples may help to illustrate the types of applications
where this approach is useful. In the first example, a small desktop
database containing a technical glossary is made available on the
Web in a searchable form (rather than in its previous incarnation
as a word-processing document):
http://crustacea.nhm.org/glossary
In this case, there are two tables (one holding the glossary terms,
definitions, cross-references, and synonyms; and another holding the
reference authorities). This database is largely "read-only"
and is updated only occasionally by the maintainers. On the desktop,
it is maintained as an Access database.
The second example is the collection information for the Marine
Biodiversity Processing Center of the Natural History Museum of
Los Angeles County:
http://collections.nhm.org/list.html
In this case, the source data is maintained in a Macintosh Filemaker
database. This more complex example has seven tables, covering the
characteristics of each collection, the condition of the specimens
through time, the updates to the label information, etc. By making
this information publicly available, rather than keeping it solely
as an internal document, the museum can inform curators about the
progress on their collections, and inform the public about what happens
to specimens in the Museum's collections.
Assumptions of the System
Several assumptions affected the design of this system. The database
maintenance software must generate comma- or tab-separated dumps
of the data. The interchange format is comma- (or tab-) separated
text; hence the system that maintains the database (probably a desktop
database or file) must be able to generate this style of output.
It is possible to sidestep this requirement if additional processing
on the server end can convert the dump into a comma-separated format.
I'll to refer to these as "CSV files", although they
can be a bit more diverse than that.
The database is small. For simplicity, the entire database (all
records) is uploaded and inserted into the Web server's database
each time there is an update. Hence, the time this takes is a limiting
factor. Clearly, the speed of the processor, available memory, and
network speed all influence the practical limit. In one example,
we are running a single-processor, 233-MHz server with 128 MB of
RAM. It takes a few minutes to upload a 50,000-record database with
about 20 fields per record.
The Web-accessible version needs infrequent updates. Since the
entire database is uploaded with each change, very frequent updates
become unwieldy. Taking a minute or two each day or week to manually
update the Web version of the database is convenient; doing so every
10 minutes is impractical.
The Web-accessible version is read-only. Data flows in one direction
-- from a desktop or workstation to the Web-accessible database.
There is no provision for directly incorporating changes from the
Web, therefore this system is appropriate for "publishing"
data onto the Web, but it is inappropriate for any interactive Web
application where updates come from Web forms.
Solution Architecture
Data Path
The next few paragraphs provide an overview of the flow of data
through this system. The data starts in some sort of application
that probably resides on someone's desktop workstation, which
might be an Access or Filemaker database, or even a carefully maintained
spreadsheet. The first step is to generate (on the desktop computer)
a CSV file that is a dump of the whole database contents.
In the next step, a simple Web form is used to upload that text
file to the Web server and initiate processing there. I'm currently
using a Perl CGI script, but there are few constraints on how this
must be done. It would even be reasonable to FTP the file there
and manually start the conversion program. The only important consideration
is to use the security features of the server to limit access to
this update method to authorized staff.
Next, on the server, a Perl program processes the data and inserts
it into the server's database. The program reads a configuration
file that describes the incoming data and the target database, then
reads the input CSV file and performs all necessary conversions.
The final home for this data is a database that's directly
accessible by the Web server, so the data is "visible"
to dynamic applications running on the Web server. There are a number
of databases that could fit the bill, ranging from simple and free
to complex and expensive. I'm currently using the MySQL database,
but little (if any) change would be needed to use any database supported
by Perl's DBI database interface system. As the data is converted,
it is inserted into a new table in the server's database. Upon
completion, the existing table (if any) is renamed to serve as a
backup, and the new one is renamed to replace it.
The CSV2db Perl Program
This program is the core of the system, responsible for all data
conversions as well as inserting the data into the Web-accessible
database. It is implemented as a small collection of Perl modules
that do all the real work. The CSV2db::Update module contains the
code for the main processing object. A new processing object is
created (with initial values setting various attributes), and a
file handle to the CSV data is handed to the table-processing routine.
The whole "wrapper" around the module could be as short
as:
#!/usr/bin/perl -w
use strict;
use IO::File;
use CSV2db::Update;
my $fh = IO::File->new('mytable.csv');
my $update =
CSV2db::Update->new(
configfile => 'mydatabase.config',
dbname => 'mydatabase',
dbid => 'dbwriter',
dbpw => 'WritePass');
$update->table('mytable', $fh);
In this example, the name of the database, the database user with
appropriate privileges to perform the update, and that user's
database password are all hard-coded as initialization arguments.
CSV2db::Update's "new" method does nothing but initialize
the object. The attributes can also be set using accessor notation
after the object is created, for example:
$update->dbname('mydatabase');
Other attributes that can be set include:
dbhost (defaults to localhost) -- The name of
the database host computer.
dbdname (defaults to mysql) -- The Perl DBD
driver to use when connecting to the database engine.
csvquotechar and csvfieldsep (default to the double-quote
character and comma, respectively) -- Specifies how to interpret
the CSV file.
fake -- If set to a true value, allows all the processing
to occur, but prevents the actual database update.
The workhorse of the system is CSV2db::Update's "table"
method. As shown previously, it's called with an open filehandle
to the incoming CSV data. The interpretation of the incoming CSV
data is controlled by the configuration file that the CSV2db::Update
object reads. I'll cover that configuration file in more detail
later. Briefly, though, that file includes a list of all tables
in the database that can be updated. For each table, each of the
fields incoming in the CSV file is named, its database type is determined,
indexing is specified, and any special processing for the field
is given.
It is also possible to specify pre-processing of the whole CSV
file in the configuration file. This can be particularly useful
to deal with line-ending conversions or to convert file formats
that are too far from a plausible CSV format to be read directly.
To achieve this pre-processing, the "table" method internally
filters the entire file through the desired routines into a new
temporary file.
Based on the specifications in the configuration file, the "table"
method creates SQL statements to properly create the new version
of the data table and goes ahead to create it (using a temporary
name, in case something goes awry). An appropriate SQL "INSERT"
statement is created to prepare for the data loading. Finally, record-by-record,
the CSV file is read and processed. Each field may optionally have
special processing performed on it (e.g., converting character sets,
fixing date formats, trimming blanks, etc.). Once the fields are
ready, each record is loaded into the new table.
Three ancillary modules currently are defined that provide routines
useful for specific pre-processing or field-level processing. The
CSV2db::Filemaker module (Listing 1) provides short routines useful
for converting dates as exported by Filemaker into SQL-acceptable
dates, as well as converting some characters internal to Filemaker
records into more useful versions. The CSV2db::Mac module (Listing
2) contains the routine (macroman2iso) that does a conversion
from the Macintosh-specific character set to an approximation of
the ISO-8859-1 character set. Finally, the CSV2db::Util module (Listing
3) contains a few other routines useful for fixing line endings,
trimming blanks, and testing. (All listings for this article are
available for download at: http://www.sysadminmag.com/code/.)
The Configuration File
The configuration file is at the heart of this system. It is here
that the incoming CSV datafile is described, the table fields are
named and described, database indexing is specified, and any additional
processing is prescribed. For maximum flexibility, I chose the strategy
that the configuration file is a parseable chunk of Perl code. This
avoids the problem of developing a specialized parser. It also cleanly
enables a facility for including code directly in the configuration
file. A short example follows:
{
preprocess => [],
postprocess => [\&CSV2db::Mac::macroman2iso],
options => {fake => 1,
verbose => 1,},
tables => {
# tablename =>
# [[fieldname, fieldtype, index, fulltext, proc],...]
mytable =>
[["uid", "varchar(20) primary key", 0, 0],
["name", "varchar(200)", 1, 1],
["birth", "date", 1, 0,
\&CSV2db::Filemaker::date2sql],
["remark" "text", 0, 1],
],
another_table =>
[["Taxon", "varchar(200)", 1, 0,
sub {$_[0] = uc $_[0]}],
["Locality", "text", 0, 0],
["Collector", "", 0, 0],
["Depth", "double", 1, 0],
],
},
}
The opening and closing braces are there just for Perl's parsing
happiness. Within them, four keys are permitted in a configuration
file. The "preprocess" key (empty in this case) signifies
an array of pointers to routines that can pre-process the entire incoming
CSV datafile. These routines will be called in the order listed, with
an open filehandle to the CSV data as their first argument and an
open filehandle to a new temporary file, which they should fill with
the processed data.
The "postprocess" key signifies an array of pointers
to routines that are run on every single field in the incoming CSV
data before it's inserted into the database. In this example,
a routine that converts from the Macintosh character set to an ISO-8859-1
character set is specified. Any routine specified here will be given
the field contents as a single argument and is expected to return
the digested field contents as a single return value.
The "options" key signifies a hash of option settings.
These provide another way to modify the attributes of CSV2db::Update
object. Finally, the "tables" key signifies a hash of
keys, one per database table to be specified. Each key here is the
name of a table and points to an array of field definitions. Each
field definition is, itself, an array. The first element is the
name of the field. Next is the database type that this field should
have (these values, of course, may be database-engine dependent).
If the database-type field is empty, this field will not be included
in the database. The third value, if true, indicates that this field
should be indexed in the database. The fourth field, if true, indicates
that (if the database engine supports it), the field should be part
of a special full-text index for the table. The fifth (and optional)
field is a pointer to a field-specific processing routine.
With these field-specific processing routines (and their related
"postprocess" routines), things get interesting. Since
we are allowing Perl to parse the configuration file, we can either
insert a pointer to an existing routine (as in the "birth"
field above), or insert the actual code as an anonymous subroutine
(as in the "Taxon" field, where a code snippet is used
to force the field to uppercase on the fly). This capability allows
us to easily take care of a range of problems. Minor "fix-ups"
to data formats can be coded as routines and saved in modules (like
"Mac.pm" and "Filemaker.pm") to be called as
needed. Alternatively, short (or long) conversion routines that
are particular to one application's data can be inserted directly
into the configuration file, avoiding the complexity of carrying
around modules of limited usefulness.
Simple Web Interface for the CSV-File Uploads
The last element in the system is the front-end presented to the
database maintainer for uploading the CSV files to the system. It
is convenient to implement this as a CGI program presenting a Web
page that allows for file uploads. For security reasons, the Web
server must enforce access controls on this page, so that only authorized
users can use the page.
There's nothing complex about this part of the system --
it simply presents the user with a set of file upload form elements
(one for each table of the relevant database). The table number
and names are based on the same configuration file as the CSV2db
program reads. Once uploaded, the CGI program is responsible for
initializing the CSV2db::Update (Listing 4) object with the appropriate
parameters, and calling the "table" method to process
the data (as shown in the short "wrapper" example above).
Now that I've reviewed the flow of data and the structure
of the conversion process, I'll describe a few of the issues
that I had to tackle. Some of them were obvious from the start,
but others were encountered along the way.
Coping with "Comma-Separated Text" Files
There is no standard for comma-separated text formats -- it's
a pragmatic format that is partially reinvented by every application.
By using the Text::CSV_XS module by Jochen Wiedmann (available through
the CPAN at http://search.cpan.org), most problems were sidestepped.
That pre-existing code takes care of variations in the quote and
field-separator characters (configured in CSV2db using the csvquotechar
and csvfieldsep settings). More interesting, though, was
the discovery that some programs generate CSV files that have newlines
embedded within fields, or that decide to be "helpful"
and wrap long records into multiple lines. The Text::CSV_XS module,
tolerant though it is, does insist on one record per line. To handle
that problem, a fairly intricate piece of code was developed that
continues reading data in the event of an incomplete field or record.
Split fields are, in effect, glued back together; split records
are reassembled, and the location in the file is checked for synchronization
with the beginning of the next line or record.
Character-Set Translation
Some of our data originates in database files on Macintosh computers.
The Macintosh operating system uses character encoding known most
commonly as "MacRoman", rather than the more common ISO-8859-1
character set. (When the Macintosh character set was devised, there
really weren't any international standards for character sets.
The ISO classification was developed after MacOS.) Glyphs in MacRoman
other than standard low-order ASCII will be misrepresented by other
systems. This problem, of course, is not restricted to Macintoshes,
but applies to any situation where the character set of the original
data differs from that of the target database.
In some cases the solution is simple, for example, when there
is a one-to-one mapping of all glyphs in the original character
set to glyphs in the target character set. In the case of MacRoman,
however, glyphs exist that have no representation in the target
character set (ISO-8859-1). The solution chosen was to use a routine
that performs a "pragmatic" translation. For the "missing"
characters, a textual equivalent is generated. Examples include
substituting the text "infinity" for the infinity symbol
and "(TM)" for the trademark symbol.
Other "Fix-Ups"
Depending on the source, numerous minor processing steps may be
needed to preen the data into final format. The ability to specify
arbitrary processing on a field-wise (or even file-wise) basis is
an indispensable feature of the configuration file.
A few examples come from using Filemaker as a data source. Filemaker
encodes newlines embedded within text fields as control-K (ASCII
character 13) characters. Fields where this may be an issue must
have those characters translated into legitimate newline characters
before input into the online database. Similarly, Filemaker uses
a control-K separated list of words to represent a set of multiple
checkboxes in a single field. By translating the control-K characters
to commas in this case, those fields can be used as "SET"-type
data for MySQL.
Dates and times often need reworking. The translation routine
applied in this case relies on the very permissive str2time
routine from Perl's Date::Parse module (available on CPAN).
It will correctly read a broad variety of dates. Once read, a date
can then be processed into the desired "YYYY-MM-DD" format.
Times represented in AM/PM style can be processed into the desired
24-hour format.
MySQL "SET"-Type Fields
As mentioned briefly earlier, we take advantage of a MySQL extension
to standard SQL -- a "SET" data type. Quoting from
the MySQL documentation, "A SET is a string object that can
have zero or more values, each of which must be chosen from a list
of allowed values specified when the table is created." Storage
and retrieval of SET values is particularly efficient in space and
time.
What's the problem? The need to know the allowed values at
the time the table is created. All values that are used occur in
the incoming data file, so it seems unfortunate to require that
they be specified in the configuration file as well. The solution
adopted is to simply read the incoming data twice (in the event
of any SET definitions). The first pass is used solely to accumulate
lists of allowed values for each SET field so that the appropriate
commands can be issued during the table creation. The second pass
inserts records into the database. This solution trades off an increased
initial processing time against an error-prone requirement that
all allowed values must be specified in the configuration file.
Drawbacks and Advantages
This approach (creating online-accessible databases from CSV dumps
of desktop data) has the limitation of not directly accommodating
input from the Web, or indeed from any source other than the original
data repository. It is a publication-oriented method, not a way
to create a multiple-source interactive database. Its scalability
is also limited by the speed of transmitting and processing the
entire database on each update. One advantage, however, is the system's
flexibility to accommodate a broad range of input sources. Those
sources are not required to be available on the network at all times,
and the input data format is quite flexible. This accommodates users'
favored tools rather than imposing a single solution. In many cases
we've tackled, there would have been no chance to get the data
onto the Web if we insisted that the users recreate their data sources
in some other application.
A related feature is the simplicity, from the users' perspective,
of getting their data to the Web. They have control of when updates
occur, and the update process is as simple as generating a data
dump with their workstations and uploading files on a simple Web
browser form. All the picky work is done on the server end.
By isolating the users' data sources from the network-available
version of the database, a raft of security problems are neatly
sidestepped. Certainly the online version of the database should
be protected from malicious or inadvertent corruption. However,
corruption of the online version has no effect on the core source
data. This accommodates environments where users cannot be compelled
to maintain high security on personal servers, or where their operating
systems may be inherently non-secure. As long as the source data
is backed up, they're safe independently of whether they're
also available on the Web. For the limited but common cases where
this approach is applicable, we've found it to work very well.
Dean has worked with UNIX in a variety of academic research
settings since 1983. He currently splits his time between research
in marine biomechanics and academic systems administration. He can
be reached at dean2@biol.sc.edu, and will make updated
versions of this program available at http://collections.nhm.org/code.
|