isascii: The Easy Way to Work around Binary Files
Steven G. Isaacson
I needed to find all occurrences of the string DBDATE
in our shell
scripts. Unfortunately, I did not have a list of files
to change. All I
knew was that several files would need to be changed
and that these
files resided in our local bin directory. Usually, identifying
files
that contain a given string is a simple task; you cd
to the directory
and grep for the string in question. The task is not
so simple, however,
when the directory contains binary as well as ASCII
files.
Here's what can happen. When you grep for a string of
text in a binary
file, you may find what you're looking for. You may
also find that your
terminal suddenly locks up, or starts to beep, or decides
to display
everything in a hitherto unknown graphics mode.
The problem is that grep is the wrong tool for the job.
grep is the
wrong tool because it is line oriented, and binary files
are not. grep
looks for regular expressions (such as DBDATE), and
when it finds a line
containing the regular expression, it prints the entire
line. The entire
line consists of the regular expression and whatever
else happens to be
in the file up to and including the next new-line character.
It's those
"whatever else happens to be in the file"
characters terminal command
sequence characters that cause your terminal to misbehave.
Sometimes you do want to look for strings in binary
files. For example,
when trying to figure out which program is producing
an error message,
you can use the strings program to identify strings
of text in binary
files. The strings program prints strings of text in
object or binary
files, and defines a string as four or more printing
characters ending
with a new-line or null character.
To see whether DBDATE is compiled into binary.exe, you
can execute:
strings binary.exe | grep DBDATE
Occasionally, Ialso use strings to poke through a core
file, looking for
clues about what caused the core dump. Sometimes the
shell fails,
sometimes awk is overtaxed. Checking the strings in
a core file may give
you an inkling about where to look next.
So what do you do if you're in a directory that contains
both shell
scripts and binary files and you want to avoid the binary
files? One
approach is to use the file command to determine whether
each file is
ASCII or not. The file command reads the magic number
at the start of a
file and identifies the file type. If file says "executable
something,"
then that file is a binary file. If file says "text
something," then you
can grep or awk or sed it.
For example:
$ file /usr/bin/vi
/usr/bin/vi: s800 shared executable dynamically linked
$ file /etc/profile
/etc/profile: ascii text
Unfortunately, file has two drawbacks. First, it's one
of those UNIX
utilities, like chmod, chgrp, and ls, that can't read
from stdin.
Instead you have to call the program over and over or
else use xargs to
avoid "arg list too long." Second, it's not
infallible. I've seen binary
programs identified as awk text.
I wanted a fast, easy, reliable way to distinguish binary
files from
ASCII files. I wanted a program I could pipe filenames
to and get the
names of ASCII files back. I wanted to, for example,
cd to a directory
and easily find out what was ASCII and what was not,
simply by handing
it the output from ls like this:
ls | isascii -
Once you have the list of ASCII files, the grepping
is easy.
grep DBDATE `ls | isascii -`
And if you're using the Korn shell, you can pop down
into command-line
history, add -l to get just the filenames, tack vi on
to the front of
the results, and get to work.
vi $(grep -l DBDATE $(ls | isascii -))
The dollar sign and parentheses are Korn-shell syntax.
Here's how to do
it in the Bourne shell.
vi `grep -l DBDATE`ls | isascii -\``
With a program like isascii, it's also fun to do a little
investigating.
For example, cd to /usr/bin and see which files are
ASCII and which
ones are not.
$ cd /usr/bin
$ ls | isascii -
auto
calendar
sccssdiff
spell
which
If you want to know how which works, take a look!
The Code
isascii (See Listing 1) is a simple program. It reads
the specified file
looking for non-ASCII characters using the standard
C isascii()
function. As soon as a non-ASCII character is found,
you have the
answer: non-ASCII. Otherwise, you reach the end of the
file, and you
also have the answer: ASCII.
You can reverse the results by using the -v flag to
identify binary
files. For example, here's how to find files that contain
at least one
non-ASCII character:
ls | isascii -v -
isascii accepts filenames on the command line. You can
pipe a list of
filenames to it, and you can specify a file that contains
a list of
filenames. No xargs are required. You can also use the
exit status when
only one file is being checked. For example, to ensure
the file is ASCII
before editing, you could do this:
if isascii $file
then
vi $file
else
echo "$file is not an ASCII file"
fi
Other Uses
We have a version control shell script that cycles through
thousands of
files every night. Occasionally, the script fails. Recently,
I
discovered the script would fail whenever awk dumped
core. Why would awk
dump core? A binary file was slipping through the file
filter, and awk,
like grep, is designed to work with ASCII files of reasonable
line
length. The binary file that was slipping through the
filter either had
no lines or the lines were too long, and so awk failed.
Adding isascii
to filter out and report on the binary files prevents
this type of
problem from ever happening again.
About the Author
Steven G. Isaacson works with the Quality Assurance
group at FourGen
Software in Seattle, Washington: http://www.fourgen.com.
He can be
reached via email at stevei@fourgen.com.
|