What does the phrase "binary data" or "binary format" mean? Call me a
pedant, but I was under the impression that *all* digital data is binary
data. Binary means "expressed in ones and zeros" right? Given that, does
it not follow that all digital data without exception is binary data?
The truth is, the phrase "binary data" is a misnomer. It is used as the
opposite of the phrase "plain ASCII". In other words, data that cannot
be grokked with a simple text editor. As XML takes hold to become, to
use Tim Bray's characterization, "the new ASCII"[1], binary data will be
re-interpreted to mean the opposite of "plain XML". It is still a
misnomer of course because all XML is made up of binary data but we are
clearly stuck with the phrase at this stage.
Why would an application designer use a binary format to store data? The
single most common rationale for not using plain old XML goes something
like this:
"The application needs to be fast and efficient, therefore it stores
data in a binary format rather than XML."
This is often both a non sequitur and a ruse. Let start with the non
sequitur.
The PC I am writing this article on is really fast. It has a 2000MHz
processor that spends most of its time doing precisely *nothing*. Raw
CPU power is simply not a scarce commodity. I'm not saying that I have
enough power and that my applications go fast enough. That is never the
case. However, the bottlenecks that slow my applications down these days
are not related to lack of CPU power.
As for efficiency, yes, we all care about how much disk/bandwidth our
data uses. However, we live in a world in which compression algorithms
such as the ubiquitous ZIP format have been commoditized. Given that,
the efficiency arguments against native XML storage can be dealt with by
simply ZIPPing XML to/from disk. In other words, we can get the plain
text, benefits of using native XML data formats and yet have
disk/bandwidth efficiency at the same time.
Case in point: OpenOffice[2]. Create a document with its word processor
and save it to disk. The file saved to disk is actually a simple zip
file. Open it with any zip reader. You will see some XML files. There is
an XML file the text of the document; an XML file for the style
information; an XML file for the metadata and so on. The result?
Efficient storage with plain text XML encoding of the information.
Beautiful!
Now lets turn to the ruse that often underlines the rationale for binary
data formats. Here is my definition of "binary data": data owned by
somebody else. Namely, the entity that created the program that
reads/writes the data in a form I don't understand and cannot read.
It's a sobering thought. Who really owns the data on your disk that is
stored in application-specific binary format? You or the application
that created it? With OpenOffice, I can dip into my data with commodity
software. I can write single page Python applications that do useful
things to my OpenOffice XML data without having to learn any application
APIs or buy any software. I feel a higher degree of ownership and
control over the data I wield with OpenOffice. I like ownership and
control. Do I have ownership and control of the native binary formats on
my machine? In a word, no.
The OpenOffice approach combines openness with efficiency in a beautiful
way. I hope that it starts a trend. A trend that will hopefully start
the process of exposing the binary data ruse for what it is: a way of
ensuring that your data is locked into the application that created it.
In a world with commoditized XML and compression mechanisms, is there
any justification for binary, proprietary data formats? Can we look
forward to a day in which the only binary files on our disks are
compressed archives of XML or throwaway compiled files created at
required from the native XML? I don't see why not.
Binary data? Just say NO!
NOTES
[1] http://dblab.ce.cnu.ac.kr/~dolphin/xml/wsw.html
[2] http://www.openoffice.org