Converting Data between UNIX and Windows
Ed Schaefer and Charles Leonard
With the further entrenching of client-server RDBMS
software
development, system administrators can expect developers
to request data
conversion utilities between the Windows client and
the UNIX server and
back again. While most industrial-strength databases
provide ASCII
export utilities, the format of the exports will most
certainly differ.
This article presents a C utility, convert.c (Listing
1), which converts
character-delimited ASCII files to the comma-separated
value (CSV)
format and back again.
The ASCII files being converted are from an Informix
database export on
the server. The database used on the Windows client
is Gupta's SQLBase.
The utility has been designed to be run under UNIX or
DOS/Windows.
CSV and Character-Delimited Formats
Betraying their PC origin, databases under Windows generally
use the
comma separated value format (CSV). In CSV format, character-string
fields are surrounded by double quotes delimited by
a comma while date,
datetime, and numerical data types are just comma delimited.
Under UNIX, Informix database unloads are character
field delimited. All
fields, regardless of data type, are delimited with
a single character.
This character by default is the pipe symbol (|).
Given a schema for table tcf:
create table tcf
(
col1 char(10),
col2 datetime year to second,
col3 decimal(6,2)
};
An example pipe-symbol-delimited data record would be:
abc|1995-09-02 13:29:18|55.34
and the corresponding CSV example would be:
"abc",1995-09-02 13:29:18,55.34
Unfortunately, in order to change to CSV format, the
schema of the table
being converted must be known. If the column data type
is not known, it
is difficult to determine which data should be surrounded
by quotes and
which should not.
The Informix ASCII Export
Informix's ASCII export creates a directory with an
SQL script to
recreate each database table and a corresponding ASCII
data file for
each table. An example excerpt from the SQL file would
look like this:
{ TABLE "root".tcf row size = 22 number of columns = 3 }
{ unload file name = tcf___121.unl number of rows = 1 }
create table tcf
(
col1 char(10),
col2 datetime year to second,
col3 decimal(6,2)
};
Pseudo Code for the Conversion Utility
Given the above information about the export, the pseudo
code for
convert.c is:
1) Open the SQL schema file passed as argument 1.
2) Initialize the dynamic array. Each table column
is an element in a
dynamic array. Since the array is dynamic, table size
is limited only by
available memory.
3) Read each line in the schema file searching for
the TABLE keyword.
Perform steps 3 through 9 for each TABLE defined in
the schema file.
4) Continue to read the schema file searching for
the unload keyword
to determine the datafile (table) name.
5) After determining the table name, continue reading
the schema file,
parsing each column line by type and storing it in the
dynamic array.
6) Open the datafile and determine whether the conversion
is
UNIX-to-CSV or the opposite.
7) Read each data record of the datafile.
8) For each data record:
a) Parse each column using the correct delimiter.
b) Create a new column appending it to a new output
record.
c) Write it out to an intermediate file.
9) Rename the intermediate file to the original data
file name.
10) Close the SQL schema file.
Program Description
Since convert.c basically just performs tedious string
handling, we'll
only discuss certain caveats and interesting points
of the algorithm.
Step 6 determines whether the conversion is UNIX-to-CSV
or CSV-to-DOS.
The code reads the first line of the data file and if
the UNIX
delimiter, by default the pipe symbol (|), exists, the
conversion is
UNIX-to-CSV.
The standard C library function used to parse a string
is strtok().
Unfortunately, strtok() doesn't provide for null or
undefined values.
Given the following string, with the comma being the
delimiter:
a,,string,of,tokens
strtok() will return only four values, since the second
null value will
be folded into the previous token. Convert provides
a replacement
tokenizing function called xstrtok. Operating on the
string above,
xstrtok() returns five tokens, with the second token
being null.
While the character and decimal data types are largely
self-explanatory,
the date and datetime datatypes are not. The date format
can vary widely
among database vendors. Convert assumes the Informix
format is of the
type MM/DD/YY. The Windows SQLBASE date format is YYYY-MM-DD.
The datetime data type in Informix is variable length,
with the length
depending on whether the definition is year-to-minute
or year-to-second.
An example datetime year-to-second would be:
1995-09-03 13:29:18
And an example datetime year-to-minute would be:
1995-09-03 13:29
The same example datetime under the Windows SQLBase
is always carried to
the nearest micro-second:
1995-09-03-13.29.18.000000
Since this utility runs under both Windows and UNIX,
note the different
commands needed to rename the intermediate file.
Conclusion
For this article we've presented a utility to convert
between UNIX
field-delimited and DOS comma-separated value. We've
also looked at some
of the pitfalls, such as converting date and datetime
datatypes.
While convert.c performs its assigned task, a commercial
tool that will
move data between the different file types and databases
is available.
See the sidebar for a discussion of our experience with
this product.
About the Authors
Ed Schaefer is a frequent contributer to Sys Admin.
He is an Informix
Database Administrator and UNIX administrator for jeTECH
Data Systems of
Moorpark, CA, where he develops Time and Attendance
Software. He can be
reached at olded@ix.netcom.com.
Charles Leonard is a software developer for jeTECH Data
Systems of
Moorpark, CA. His previous experience includes five
years as a Software
Engineer on the Deep Space Network at Jet Propulsion
Laboratory in
Pasadena, CA.
|