Self-Evaluation of the "BX" Proposal
Paul R. Pierce
Abstract
An evaluation of the "BX" proposal for binary XML against the properties described by the XML Binary Characterization working group.
Revision
February 22, 2006
Contents
The call for proposals for a binary format for XML requested an evaluation of each submitted proposal against the desirable properties set forth in the document "XML Binary Characterization Properties", W3C Working Group Note 31, March 2005. As I've submitted a proposal it follows I should submit an evaluation, and here it is. Section numbers in parentheses refer to the Working Group note.
Briefly, the "BX" proposal outlines a binary format for XML in terms of a set of optional, independent, reversible transformations between standard XML and a binary form. Some of the transformations are concerned with translation between text-oriented XML and a compact binary representation of the information, the rest provide options for compression using well known techniques. The key translation transformations are schema driven, using either XML-Schema as found or annotated with markup specified for the purpose of directing the translation. All transformations operate on symbols, essentially integers of different types and initially the characters of a standard XML document.
2.1 (4.1) Processing Efficiency
The processing efficiency for serialization, parsing and data binding will vary depending on the nature of the program or library performing the task and the transformations selected. Applications requiring extreme speed (especially with numerical data) can read or write the format directly from an appropriately designed data model, see section 3, Direct Read and Write.
General purpose libraries within the XML stack should be able to perform most of the transformations at about the same speed as with text XML. Some combinations of transformations might prove to be significantly faster. In particular, it should be possible to code a schema processor that builds parsing tables driven by the numeric tokens produced by the Name, Entity and Value transformations, so that both serialization and parsing can run much faster than a text based implementation.
Compression transformations should be able to run at speeds comparable to that of corresponding existing utilities such as gzip and bzip2. Again there may be an opportunity for higher speed as the transformations are specified to operate on whole symbols instead of bytes as with the existing utilities. As with existing utilities, there is a trade-off between running time and compression efficiency.
Because of its flexibility due to the number of different transformations specified, a full general purpose implementation will not be as small as a standard XML implementation. However, extremely small footprint is possible with Direct Read and Write as described in section 3, and a small footprint should be possible for other selected sets of transformations.
A full general purpose implementation, particularly with high performance compression transformations, is likely to require at least as much space as a standard XML implementation, and on the other hand a program using the Direct Read and Write (section 3) features can be very space efficient. In between, it should be possible to create something like a DOM implementation, at least for a useful subset of transformations, that is significantly more space efficient than a general purpose standard XML implementation.
2.4 (5.1) Accelerated Sequential Access
The extensive use of length (in bytes) symbols is meant to facilitate quick skips through the data without parsing. I would like to extend this to include optional length symbols after each token but haven't figured out the best way to do that yet.
The format is theoretically capable of extreme compactness, just short of the best known lossless compression. There is a wide range of trade-off possible between quite compact and very fast to extremely compact but somewhat slow.
2.6 (5.3) Content Type Management
The proposed format is specified to be nothing more than an encoding. Thus, content type management is not an issue. However, it includes one transformation (Declaration) that might make it worth considering a single additional media type.
The proposal does not address deltas. Any work on standard XML with respect to deltas would likely carry forward to this proposed format with little extra effort.
2.8 (5.5) Directly Readable and Writable
The proposal is particularly designed to allow tight specification of a data format that might be read or written directly from the in-memory data model, without parsing or serialization at all unless the Octet transformation or some combination of compression transformations are also selected. See section 3.
When using the Direct Read and Write (section 3) features it will sometimes be possible to allow for direct update in place, as long as there is no change in the size of the data or data is only being appended to the end. As these are two common cases there should be many happy opportunities for efficient update.
The proposal does not specify embedding, but contemplates a separate container specification that it might make use of in certain situations.
The proposed format conveys all existing XML information and is fragmentable, so it should support existing encryption techniques.
The Octet transformation, when selected for serialization, produces a stream that includes basic type information gleaned from the schema.
There is no explicit general mechanism for extension in the proposal. Instead, its overall design lends itself to extension by adding new transforms that can be selected directly in the XML declaration, or by adding new schema annotation to select new options in the existing transformations. There is also some room in the value space of type tokens for new basic types.
2.14 (5.11) Format Version Identification
Yes. (Its in the XML declaration).
The proposal inherently supports fragments to the same degree as standard XML. In particular, consider that a block might contain a fragment, and all transformations are block-aware.
The proposal is all about flexibility, in order to be as general as possible. It is particularly designed to be useful for very short or long documents; to make the best of highly structured (especially numeric) documents while still providing advantages for loosely structured documents; and while many important transformations are schema driven others do not need a schema and for those that do there may be an alternative.
2.17 (5.14) Human Language Neutral
The proposal is as neutral as standard XML and XML-Schema are already.
2.18 (5.15) Human Readable and Editable
The proposed format is not in general directly human readable in any of its optional forms except when only trivial transforms are selected. Of the considerations mentioned in the Binary Characterization Properties report two of note are the property of being self-contained (see below) and the principle of keeping things in their original order, which happens unless the Transpose or Block Sort transformations are selected.
2.19 (5.16) Integratable into XML Stack
By appearing as an encoding and supporting all of XML this proposed format is intended to be completely and transparently integratable into the XML stack. A full, general purpose implementation can be hidden behind the existing SAX or DOM API, activated automatically when the desired encoding is specified in the XML declaration of a document.
In general the proposed format supports this property as well as or better than existing XML, except when the more aggressive compression transformations are selected.
2.21 (5.18) No Arbitrary Limits
A few aspects of the proposal are limited to the range of a 64-bit integer, otherwise the proposed format does not impose any limits that do not already exist in standard XML. Block sizes are limited to 1 megabyte, however that does not limit the range of expression of the format since data items (including strings and lists) can cross block boundaries down to (but not including) the lowest, atomic level.
2.22 (5.19) Platform Neutrality
The proposal is entirely platform neutral except in the use of "network byte order", which is of course standard but the opposite of most of the general purpose computers in the world. It might be a very good idea to add options via the schema annotation namespace to force a particular byte order or to allow for "reader makes right".
The proposed format does not specifically support random access but is more or less indexable. Even in the most compressed form, it would be possible to create some sort of index to each of the lowest level data items or to fragments, though to access them might require decompression of their block.
The proposed format is fairly robust at the most basic level, in that the separate blocks of a document are mostly independent. It is possible to do better within the constraints of the specified format, by forcing block boundaries at specified fragment boundaries through schema annotation, and by including error codes within the regular data items of each fragment.
The proposed format is specified in terms of round tripping, full support is inherent in its design.
2.26 (5.23) Schema Extensions and Deviations
Although important transformations are schema driven, the proposal discusses ways to work around this requirement when necessary. Once a workaround has been selected, the Octet transformation provides a means of expressing the extended information in the binary format.
2.27 (5.24) Schema Instance Change Resilience
One of the important features of the schema annotation in the proposal is to fix the values of tokens used for elements and attributes. With these annotations a schema can be designed so that changes will not affect existing applications (without schema update) as long as communication with those applications is restricted to the common subset of the old and new schema. Alternatively, the Octet transformation can be used (as it would be for Schema Extensions and Deviations) so that applications parsing data from a new schema can detect that there have been changes. Also see Self Contained below.
Although schema driven, the binary format can be self contained in one of two ways. As noted above, the Octet transformation can be selected to embed schema information in the data. This is the main reason for the existence of the Octet transformation. Alternatively, as noted in the proposal, a separately specified XML container format can be used to encapsulate the schema with the data.
See Encryptable, the proposed format is intended to be compatible with existing techniques for signing documents.
2.30 (5.27) Specialized Codecs
I feel this should be addressed in an XML container specification.
The proposed format supports streamable data in two ways. First, the configurable block size limits the amount of data that must be handled at one time. Second, for finer streamability the block oriented transformations might not be selected; then the format is at least as streamable as standard XML.
2.32 (5.29) Support for Error Correction
The proposed format does not specify support for error correction. However, see Robustness above.
2.33 (5.30) Transport Independence
The proposed format is transport independent as long as the transport can convey arbitrary binary data.
2.34 (6.1) Forward Compatibility
The proposed format is designed for forward compatibility as discussed in the Binary Characterization Properties report.
2.35 (6.2) Implementation Cost
The proposal is complex due to its high flexibility and ambitious reach. It is reasonable to expect that the total cost for a full implementation would be significantly greater than for a corresponding implementation for standard XML. As with other properties, a Direct Read and Write implementation, on the other hand, is very easy.
The proposal is particularly designed with the problem of a full implementation in mind, through the use of independent transformations. After a relatively small amount of base development, each transformation (and its reverse) can be implemented independently or not at all if it will not be used in a particular application. In addition, once full implementations become available it will be possible to very quickly implement and test individual applications that use the Direct Read and Write feature. These applications will immediately be able to interchange data with other applications that make use of the full implementation.
The proposal is in the public domain, and I have no patents or other intellectual property rights on any of the specified mechanisms.
2.37 (6.4) Single Conformance Class
The proposal does not define conformance class(es). Although a single conformance class might be assumed, the complexity of the proposal allows for an argument to consider multiple classes.
2.38 (6.5) Widespread Adoption
Several of the above properties, especially the ease of integration into the XML stack, argue for the possibility of widespread adoption. The main hurdle, due to total implementation cost, would be to integrate a full implementation into one or more of the most popular XML libraries.
The example in the proposal does not clearly show how the format can be easily manipulated to map XML directly to a useful (though not arbitrary) data model.
Consider this XML document:
<?xml version="1.0"?> <run xmlns="http://example.com/run"> <description date="2004-02-07"> Morning run, west side </description> <data mark="1600" top="-31" length="3"> <sample x="123.456" y="107.89" flow="76.414" count="5662"/> <sample x="127.564" y="102.91" flow="72.952" count="4828"/> <sample x="128.654" y="101.32" flow="76.385" count="5503"/> </data> </run>
|
And a corresponding schema:
<?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:bx="http://example.org/BX"> <bx:binary> <bx:required bx:encoding="BX-1.0-DSNV"/> </bx:binary> <xsd:element name="run"> <xsd:sequence>
<xsd:element name="description"> <xsd:attribute name="date" type="xsd:date" use="required"/> <xsd:simpleType> <xsd:restriction base="string"> <xsd:length value="40"/> </xsd:restriction> </xsd:simpleType> </xsd:element>
<xsd:element name="data"> <xsd:attribute name="mark" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="65536"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="top" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:integer"> <xsd:maxExclusive value="32768"/> <xsd:minInclusive value="-32768"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="length" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="4194304"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <xsd:sequence> <xsd:element name="sample" minOccurs="0" maxOccurs="unbounded"/> <xsd:attribute name="x" type="xsd:float" use="required"/> <xsd:attribute name="x" type="xsd:float" use="required"/> <xsd:attribute name="flow" type="xsd:float" use="required"/> <xsd:attribute name="count" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="1000000"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:element> </xsd:sequence> </xsd:element> </xsd:sequence> </xsd:element> </xsd:schema>
|
The schema tells exactly which transforms to select. It defines a fixed structure, with the only variable being the number of sample elements, and since these come at the end the number can be easily deduced from the file size of the binary document. It also defines a fixed length or value range for every type.
Everything is sufficiently constrained that a parser would need no hints about element or attribute order or data size. Thus, the name and value transformations would mark all tokens of width zero, so that they don't appear.
Every document serialized using this schema would map to the following C structure:
typedef struct { char ch[3]; } unicode;
struct { char declaration[6]; unsigned long schemaLength; unsigned long schemaParty; unsigned long schemaDocument; unsigned long date; unicode description[40]; unsigned short mark; signed short top; unsigned long length; struct { float x; float y; float flow; unsigned long count; } sample[*]; };
|
Suppose that one wanted to build a data logger that would take samples in this format, then use off-the-shelf data reduction software to process them. The code for the embedded microprocessor in the data logger could be written in C using the structure above, where it records samples directly into memory and then writes the data file directly out from memory. The data reduction software from vendor A would use an existing XML library that had been augmented by its vendor B with a full implementation of the binary format. On being presented with the schema, the data reduction tool would display the available fields to the user who would instruct it how to deal with the data. The tool could then read any file produced by the data logger.
The proposal as written is particularly good for numerical data, since that easily conforms to fixed width fields. It might be interesting to explore the possibility of an additional transformation that would map strings to a separate, efficiently stored string table at one end or the other of the document or in each block, replacing each instance of a string value with a token that is an index into the string table.
The proposed binary format has all of the properties listed as requirements in the XML Binary Characterization report. It meets several of the requirements at the top of the list spectacularly well, even though some are contradictory. At the opposite end is Implementation Cost, which is met with qualifications as discussed in that section above.