Cover V05, I02
Article
Listing 1

feb96.tar


isascii: The Easy Way to Work around Binary Files

Steven G. Isaacson

I needed to find all occurrences of the string DBDATE in our shell scripts. Unfortunately, I did not have a list of files to change. All I knew was that several files would need to be changed and that these files resided in our local bin directory. Usually, identifying files that contain a given string is a simple task; you cd to the directory and grep for the string in question. The task is not so simple, however, when the directory contains binary as well as ASCII files.

Here's what can happen. When you grep for a string of text in a binary file, you may find what you're looking for. You may also find that your terminal suddenly locks up, or starts to beep, or decides to display everything in a hitherto unknown graphics mode.

The problem is that grep is the wrong tool for the job. grep is the wrong tool because it is line oriented, and binary files are not. grep looks for regular expressions (such as DBDATE), and when it finds a line containing the regular expression, it prints the entire line. The entire line consists of the regular expression and whatever else happens to be in the file up to and including the next new-line character. It's those "whatever else happens to be in the file" characters terminal command sequence characters that cause your terminal to misbehave.

Sometimes you do want to look for strings in binary files. For example, when trying to figure out which program is producing an error message, you can use the strings program to identify strings of text in binary files. The strings program prints strings of text in object or binary files, and defines a string as four or more printing characters ending with a new-line or null character.

To see whether DBDATE is compiled into binary.exe, you can execute:

strings binary.exe | grep DBDATE

Occasionally, Ialso use strings to poke through a core file, looking for clues about what caused the core dump. Sometimes the shell fails, sometimes awk is overtaxed. Checking the strings in a core file may give you an inkling about where to look next.

So what do you do if you're in a directory that contains both shell scripts and binary files and you want to avoid the binary files? One approach is to use the file command to determine whether each file is ASCII or not. The file command reads the magic number at the start of a file and identifies the file type. If file says "executable something," then that file is a binary file. If file says "text something," then you can grep or awk or sed it.

For example:

$ file /usr/bin/vi
/usr/bin/vi: s800 shared executable dynamically linked
$ file /etc/profile
/etc/profile: ascii text

Unfortunately, file has two drawbacks. First, it's one of those UNIX utilities, like chmod, chgrp, and ls, that can't read from stdin. Instead you have to call the program over and over or else use xargs to avoid "arg list too long." Second, it's not infallible. I've seen binary programs identified as awk text.

I wanted a fast, easy, reliable way to distinguish binary files from ASCII files. I wanted a program I could pipe filenames to and get the names of ASCII files back. I wanted to, for example, cd to a directory and easily find out what was ASCII and what was not, simply by handing it the output from ls like this:

ls | isascii -

Once you have the list of ASCII files, the grepping is easy.

grep DBDATE `ls | isascii -`

And if you're using the Korn shell, you can pop down into command-line history, add -l to get just the filenames, tack vi on to the front of the results, and get to work.

vi $(grep -l DBDATE $(ls | isascii -))

The dollar sign and parentheses are Korn-shell syntax. Here's how to do it in the Bourne shell.

vi `grep -l DBDATE`ls | isascii -\``

With a program like isascii, it's also fun to do a little investigating. For example, cd to /usr/bin and see which files are ASCII and which ones are not.

$ cd /usr/bin
$ ls | isascii -
auto
calendar
sccssdiff
spell
which

If you want to know how which works, take a look!

The Code

isascii (See Listing 1) is a simple program. It reads the specified file looking for non-ASCII characters using the standard C isascii() function. As soon as a non-ASCII character is found, you have the answer: non-ASCII. Otherwise, you reach the end of the file, and you also have the answer: ASCII.

You can reverse the results by using the -v flag to identify binary files. For example, here's how to find files that contain at least one non-ASCII character:

ls | isascii -v -

isascii accepts filenames on the command line. You can pipe a list of filenames to it, and you can specify a file that contains a list of filenames. No xargs are required. You can also use the exit status when only one file is being checked. For example, to ensure the file is ASCII before editing, you could do this:

if isascii $file
then
vi $file
else
echo "$file is not an ASCII file"
fi

Other Uses

We have a version control shell script that cycles through thousands of files every night. Occasionally, the script fails. Recently, I discovered the script would fail whenever awk dumped core. Why would awk dump core? A binary file was slipping through the file filter, and awk, like grep, is designed to work with ASCII files of reasonable line length. The binary file that was slipping through the filter either had no lines or the lines were too long, and so awk failed. Adding isascii to filter out and report on the binary files prevents this type of problem from ever happening again.

About the Author

Steven G. Isaacson works with the Quality Assurance group at FourGen Software in Seattle, Washington: http://www.fourgen.com. He can be reached via email at stevei@fourgen.com.