Verifying Dates and Times for the Millennium
Lenard Morin and Ed Schaefer
Although the millennium bug seems to be on everybody's mind, most UNIX administrators might think it's more of a problem for our mainframe brethern. After all, UNIX and its associated modern Relational Database Management Systems (RDMS) have been on the cutting edge for solving problems such as this.
But what about the flat file ASCII loads that may update your RDMS? As an administrator, are you going to trust that data from an outside source is the correct format or are you going to verify it?
This article addresses a shell script, chkdate.sh, and an awk program, chkdate.awk, which validate and correct variable-length, field-delimited, date and datetime ASCII data. chkdate.sh parses the command-line options and calls a companion awk script. The awk script reads a configuration file, reads the input data file, validates each date and datetime field against a predefined mask, and writes the data to an output file.
As the data input file is read, each field is correctly verified character-by-character against a specified mask. Since the mask is an absolute check, the data of the field must follow the mask exactly. For example, if the mask is MM/DD/YY, data 1/1/99 will fail. Also, for the same mask, 01/01/1999 will fail, because the mask expects a two-digit year.
Given the same two-digit year mask, MM/DD/YY, after the data is verified for accuracy, the year is extended to four digits. Extending the date mask to four digits eliminates the millennium problem. Of course, the program must have some alogithm for determining whether a two-digit year is a 1900 year or a 2000 year. In this case, the program assumes the two-digit years 51-99 are 1900 dates (1951-1999) and the years 00-50 are 2000 dates (2000-2050). This scheme works well for our data, but you may wish to adapt the program to use a different algorithm depending on the nature of your own data.
If any errors occur while processing the mask, an entry is made in an error log file and the shell script terminates.
The chkdate.cmd Configuration File
For example, the following two data records need the second, fourth, and sixth fields verified for accuracy:
For greater flexibility, the field number to be verified and its associated mask (delimited by a comma) are stored in a configuration file, chkdate.cmd (Listing 1).
The masks for these three fields:
deliver the following output file:
The mask fields are:
M - Month
D - Day
Y - Year
h - Hour
m - Minute
s - Second
The third field of the configuration determines how null or undefined fields will be processed. Y'es allows null fields and N'o does not.
The chkdate.sh Shell Script
The chkdate.sh shell script (Listing 2) executes:
chkdate.sh [-F"Fs"] [-f awkfile] [-c cmdfile] [-i infile] [-o outfile]
where the command-line switches are:
-F value Field separator for data records # default |
-f name Awk program file name # default chkdate.awk
-c name Command file name # default chkdate.cmd
-i name Input file name # default data.in
-o name Output file name # default data.out
-h Usage message display
The while loop/case statements support the variable number of arguments. The shell shift command shifts arguments to the left ($2 to $1, $3 to $2, etc.) decrementing the number of variables, $#. Whitespace between the command-line switches is optional.
The field separator, -F, by default is the pipe symbol (|). When attempting to change the field separator to any of the shell's special characters:
; & ( ) | ^ < >
the file separator must be quoted:
chkdate.sh -F "\&"
The chkdate.awk awk Script
After parsing the command-line arguments, an accompanying awk script, Listing 3, processes the data corresponding to the following pseudocode:
1. Read the configuration file into an array.
Skipping any blank line or comment, the records in the configuration are parsed into a one-dimensional array, entry, where subscript 1 is the field number, subscript 2 is the mask, and subscript 3 is the optional required field flag.
Using the associative properties of awk arrays, the subscript 1 of entry becomes the index of the 2-dimensional cmd array. The mask and the required field flag are set to the cmd[x, 1] and cmd[x, 2], respectively.
Awk only emulates multi-dimensional arrays. So, in order to access individual elements of the array, the non-printable separator, SUBSEP, must be used as:
split(NFLD, fld, SUBSEP)
For each record in the input file and for each field to be validated:
2. If a required field is null, set an error.
3. If the lengths of mask and field value differ, set an error.
4. Concatenate characters to build date and time components according to the mask.
5. If the year is two-digit, extend the century to four characters.
The algorithm used to extend the century is: any year greater than 50 and less than or equal to 99 is 19, and everything else is 20.
6. If the date is not valid, set an error.
The is_leap_year function is the Gregorian calendar check for leap year. Leap year is a year evenly divisible by 4, provided the year is not divisible by 100, except when divisible by 400. Thus, 2000 is a leap year; whereas 1900 is not.
Earlier versions of awk do not perform automatic character-to-numeric conversions. To force a conversion, add the variable in question to itself and add a zero (0). See the function validate in the awk script.
7. If time is not valid, set an error. The time is strictly 24-hour clock. Midnight is 0000, not 2400.
8. If an error is set, write the record number, offending field number, and error text to the error log file and terminate.
9. If no error, write the record to the output file.
chkdate.sh is a modest attempt at verifying dates and datetimes. Although it handles our millennimum date verification problems, it may not handle yours. Some possible modifications are:
1. Determine another algorithm for extending the century. For example, consider a birthdate of 05/10/30. chkdate now extends the century to 05/10/2030, which is too far in the future to make sense. Shift the paradigm so that if the extended date in the previous century (1930) makes more sense, use it.
2. Expand single-digit month and day with a preceding zero (i.e., 1/1/1999 is changed to 01/01/1999).
3. Expand the datetime check beyond seconds. Some databases support 10,000th of a second.
4. Expand the time check to handle a.m. and p.m.
Aho, Alfred, Brian Kernighan, Peter Weinberger. 1988. The Awk Programming Language. Addison-Wesley.
Dougherty, Dale. Sed & Awk. 1991. O'Reilly & Associates.
Latham, Lance. 1998. Standard C Date/Time Library, Programming the World's Calendars and Clocks. Miller Freeman.
About the Author
Lenard Morin is a senior software consultant employed by Meridian Technology group of Portland, OR. Meridian provides consultants in the Pacfic Northwest and Arizona. He can be reached at: firstname.lastname@example.org.
Ed Schaefer is a senior programmer/analyst for Intel's Factory Information and Control System (IFICS) Group in Hillsboro, OR. His viewpoints on the millennium problem or any other subject in no way reflects Intel's position. He can be reached at: email@example.com.