(unknown charset) A standard binary format description language, an idea

The following is my thoughts on the subject of developing software that 
processes binary files and the need for a language that describes binary 
formats.

File formats are varied and each require different software for reading 
and manipulation. For some formats such as SGML descendants, comma or 
line separated text files, etc. there usually is software that can 
handle a family of similar formats. e.g. Any tag-based reader software 
can easily be made to handle XML, HTML, MATHML and other variants with 
little modification.
Other formats, usually ones that are binary and compressed, have less of 
a family resemblance between them when it comes to their internal 
structure. For example, two different image formats can be very similar 
in capabilities and hold similar information, yet no two can share the 
same reader software. Image manipulation software therefore has to be 
able to handle all the different formats and their variations in 
separate bits of code, though most of them share the same basic 
structural ingredients (i.e. headers, data buffers, what have you).

Specifications of binary formats are usually given in the form of 
c-structs that denote different types of data chunks and their relative 
order and relation within the file/document. These basically tell the 
programmer the location via size and offset of fixed size data fields. 
This information is used to extract data in a predetermined fashion from 
a file and then, based on that data, know how to extract the rest of the 
information.

There is a need for a single *standard* binary format description 
language. In this language could be written binary format descriptions 
which instruct software how to read arbitrary formats and manipulate 
files, much like XML can be used to describe many text based formats 
that exist today. In fact XML seems to me ideal for the task. Given such 
a language, generic software may be written that can process any type of 
binary file and present it or manipulate it. For example, image 
manipulation software developers can be handed the format description of 
a certain image format and be able, with little additional coding 
efforts, to add support for that format in their software. Such support 
may even be added on-the-fly for some types of software in a way similar 
to how codecs work today in audio and video software.

The main advantage of this language would be for the developing 
community. It will allow a repository of all known and public binary 
formats to exist which make possible the development of generic software 
that processes binary files. Such generic software may be the code base 
of some higher, more domain specific software. This would ease 
development efforts and allow more interoperability between existing 
supported formats and new immerging formats.

An example of how a binary format description language file might look 
(for an imaginary image format - 'beatmap'):

<format name=beatmap>
    <header mandatory>
       <file-size size=8 big-endian/>
       <color-depth size=4/>
       <width size=8/>
       <height size=8/>
       <compression>
          ...
       </compression>
       <data-chunk-size>1024</data-chunk-size>
    </header>
    <data-chunk optional multiple>
</format>

A generic binary file editor, given this format specification, can know 
how to handle 'beatmap' files. It knows that the first 8 bytes denote 
the size of the file. It knows exactly from where and how the rest of 
the header information can be extracted and it knows how to handle the 
rest of the file: how to fetch the data chunks and how to process them. 
This editor may display to the users a tree representing the structure 
of the binary files so that they may analyze it.
Developers can use such generic binary format processing code to add 
support for 'beatmap's in their own software. Since the data is already 
processed by the lower level code and presented to them in as objects in 
their programming language all that remains to be done is to actually 
process the data - decompress/manipulate/display it.
Format conversion software can easily be designed and developed using 
similar files that describe how two formats differ and what should be 
done to convert a file in one to the other.

With today's proliferation of binary formats used to pass on different 
types of information in many ways distinct ways, a common language 
should exist that will be able to describe them all - for the benefit of 
users and developers of software. This would be a significant step in 
the standardization effort of digital information.

Assaf Lavie

Received on Sunday, 6 July 2003 20:19:43 UTC