Article

Converting Data between UNIX and Windows

Ed Schaefer and Charles Leonard

With the further entrenching of client-server RDBMS software development, system administrators can expect developers to request data conversion utilities between the Windows client and the UNIX server and back again. While most industrial-strength databases provide ASCII export utilities, the format of the exports will most certainly differ. This article presents a C utility, convert.c (Listing 1), which converts character-delimited ASCII files to the comma-separated value (CSV) format and back again.

The ASCII files being converted are from an Informix database export on the server. The database used on the Windows client is Gupta's SQLBase. The utility has been designed to be run under UNIX or DOS/Windows.

CSV and Character-Delimited Formats

Betraying their PC origin, databases under Windows generally use the comma separated value format (CSV). In CSV format, character-string fields are surrounded by double quotes delimited by a comma while date, datetime, and numerical data types are just comma delimited.

Under UNIX, Informix database unloads are character field delimited. All fields, regardless of data type, are delimited with a single character. This character by default is the pipe symbol (|).

Given a schema for table tcf:

create table tcf
(
col1 char(10),
col2 datetime year to second,
col3 decimal(6,2)
};

An example pipe-symbol-delimited data record would be:

abc|1995-09-02 13:29:18|55.34

and the corresponding CSV example would be:

"abc",1995-09-02 13:29:18,55.34

Unfortunately, in order to change to CSV format, the schema of the table being converted must be known. If the column data type is not known, it is difficult to determine which data should be surrounded by quotes and which should not.

The Informix ASCII Export

Informix's ASCII export creates a directory with an SQL script to recreate each database table and a corresponding ASCII data file for each table. An example excerpt from the SQL file would look like this:

{ TABLE "root".tcf row size = 22 number of columns = 3 }
{ unload file name = tcf___121.unl number of rows = 1 }
create table tcf
(
col1 char(10),
col2 datetime year to second,
col3 decimal(6,2)
};

Pseudo Code for the Conversion Utility

Given the above information about the export, the pseudo code for convert.c is:

1) Open the SQL schema file passed as argument 1.

2) Initialize the dynamic array. Each table column is an element in a dynamic array. Since the array is dynamic, table size is limited only by available memory.

3) Read each line in the schema file searching for the TABLE keyword. Perform steps 3 through 9 for each TABLE defined in the schema file.

4) Continue to read the schema file searching for the unload keyword to determine the datafile (table) name.

5) After determining the table name, continue reading the schema file, parsing each column line by type and storing it in the dynamic array.

6) Open the datafile and determine whether the conversion is UNIX-to-CSV or the opposite.

7) Read each data record of the datafile.

8) For each data record:

a) Parse each column using the correct delimiter.

b) Create a new column appending it to a new output record.

c) Write it out to an intermediate file.

9) Rename the intermediate file to the original data file name.

10) Close the SQL schema file.

Program Description

Since convert.c basically just performs tedious string handling, we'll only discuss certain caveats and interesting points of the algorithm.

Step 6 determines whether the conversion is UNIX-to-CSV or CSV-to-DOS. The code reads the first line of the data file and if the UNIX delimiter, by default the pipe symbol (|), exists, the conversion is UNIX-to-CSV.

The standard C library function used to parse a string is strtok(). Unfortunately, strtok() doesn't provide for null or undefined values. Given the following string, with the comma being the delimiter:

a,,string,of,tokens

strtok() will return only four values, since the second null value will be folded into the previous token. Convert provides a replacement tokenizing function called xstrtok. Operating on the string above, xstrtok() returns five tokens, with the second token being null.

While the character and decimal data types are largely self-explanatory, the date and datetime datatypes are not. The date format can vary widely among database vendors. Convert assumes the Informix format is of the type MM/DD/YY. The Windows SQLBASE date format is YYYY-MM-DD.

The datetime data type in Informix is variable length, with the length depending on whether the definition is year-to-minute or year-to-second.

An example datetime year-to-second would be:

1995-09-03 13:29:18

And an example datetime year-to-minute would be:

1995-09-03 13:29

The same example datetime under the Windows SQLBase is always carried to the nearest micro-second:

1995-09-03-13.29.18.000000

Since this utility runs under both Windows and UNIX, note the different commands needed to rename the intermediate file.

Conclusion

For this article we've presented a utility to convert between UNIX field-delimited and DOS comma-separated value. We've also looked at some of the pitfalls, such as converting date and datetime datatypes.

While convert.c performs its assigned task, a commercial tool that will move data between the different file types and databases is available. See the sidebar for a discussion of our experience with this product.

About the Authors

Ed Schaefer is a frequent contributer to Sys Admin. He is an Informix Database Administrator and UNIX administrator for jeTECH Data Systems of Moorpark, CA, where he develops Time and Attendance Software. He can be reached at olded@ix.netcom.com.

Charles Leonard is a software developer for jeTECH Data Systems of Moorpark, CA. His previous experience includes five years as a Software Engineer on the Deep Space Network at Jet Propulsion Laboratory in Pasadena, CA.