RE: XML Core WG review of Efficient XML Interchange (EXI) Format 1.0, draft of 2007-07-16 from Vogelheim, Daniel on 2008-01-29 (public-exi@w3.org from January 2008)

From: Vogelheim, Daniel <daniel.vogelheim@siemens.com>
Date: Tue, 29 Jan 2008 21:49:13 +0100
To: "John Cowan" <cowan@ccil.org>, <public-xml-core-wg@w3.org>
Cc: <public-exi@w3.org>
Message-ID: <2F9CF2D6D27D0C47A1C40B393F034E3D031DDE@MCHP7I7A.ww902.siemens.net>
Hello,

> This is the XML Core WG's review of EXI WD1 (2007-07-16).  [...]

Many thanks for your review! We greatly appreciate it.


It will take us a while to fully take in your feedback and reflect them
in our documents. We'd like to offer right here some brief comments that
reflect our current thinking on the issues you raised, without meaning
to preempt the eventual resolutions by the WG. In terms of time frame,
we just published a 2nd draft of our format specification [1]. This
draft mainly finishes existing content and only partially addresses the
comments and issues you raised, or other valuable feedback from the
recent TPAC meeting. The subsequent draft should be more accommodating.

Additionally, we have published initial drafts of two supplementary
documents, the EXI Primer [2] and the EXI Best Practices [3]. We hope
these will help to better explain our format and its use to the wider
XML community.


So here are our comments:

> 0) The Core XML WG remains concerned about the whole concept of 
> EXI as an alternative representation of XML infosets, but does 
> not have consensus about whether it is a Good Thing, a Bad Thing, 
> or a Neutral Thing. Further comment on this fundamental point may 
> be forthcoming later.

Thank you for your considerations. Judging from feedback we have
received - particularly at the recent TPAC - we do indeed believe that
this is a central point of any discussion of EXI ( & related
technologies ).

In this mail, we'd like to call particular attention to only the
following aspect: 

At its core, concerns about the concept of EXI often seem centered
around the perceived benefit/cost ratio of an alternative Infoset
encoding. The benefits are covered elsewhere; for the cost side, please
observe that EXI is expressedly intended to be used as an
"opt-in" technology through content negotiation or similar techniques.
For the popular use-case of XML over HTTP transmission using built-in,
standard HTTP content negotiation, this would allow seamless deployment
of EXI among EXI-capable participants or EXI-capable proxy servers,
without any change requirements for those participants that do not wish
or need to implement EXI. Such an approach should guarantee zero cost to
the audience that sees no benefit in EXI for their own purposes, and
should dramatically change the benefit/cost ratio in EXI's favor.

We would be very interested in discussing this subject further with XML
Core, the TAG, and the general public. We are looking forward to your
further comments.

(The EXI WG is well aware that some proponents of similar technologies
have styled their respective offerings as XML replacements, which is not
and has not been the standpoint of the EXI WG. We hope that the W3C WG's
work will be judged on its own words and merits.)


> 1) We find the draft somewhat hard to follow; in particular,
> the unusual and non-standard grammar notation is not easy to 
> grasp at a glance; the explanation of compression should be 
> postponed to after the grammars section; the explanation of 
> event codes is very hard to follow.

In the most recent draft, we have begun to rework some potentially
confusing parts, e.g. the improved grammar notation. The WG will 
look at other opportunities to improve readability of the 
specifications. 

Additionally, we'd like to draw your attention to the EXI Primer. This
is a supplementary document, which provides a more gentle introduction
to the EXI format. Presumably, most interested parties will look at the
Primer first. Armed with this knowledge, the EXI specification should
become a lot more amendable.


> 2) We believe it is essential to provide (as called out in 
> an editorial note) a better magic number for EXI.  The current 
> magic number is only 2 bits long, and serves to discriminate 
> between EXI and XML, but not between EXI and other formats.
> This should be fixed by using a 3-4 byte magic number.

This is in principle agreed within the WG, but the mechnism(s) are still
under discussion. There are presently several proposals under
consideration that would rectify this. The proposals mainly differ in
the allowed 'magic' identifier(s)', and whether and which of these
identifiers would be mandatory or optional.


> 3) We believe that an XML document containing xsi:type 
> attributes should be treated as a schema-informed document 
> rather than a schemaless document.  This allows processes 
> that create a single XML document to decorate it with 
> xsi:type attributes and then get good compression from
> an EXI encoder following in the pipeline.

There is one easy work-around with the present specifcation: If the
encoder is informed by an empty XML Schema, it will know about all
built-in XML Schema types and will behave as you suggest.

The WG has not found such documents to be common, and thus we are
unsure of whether such a feature will find widespread use.

The WG intends to discuss this proposal in the more general context
of allowing typed encoding for schema-less documents. The 
evaluation will presumably depend on the expected uptake of such a
feature in the community vs. the complexity it will add to the
specification.


> 8) We are strongly concerned about the concept of pluggable 
> codecs as a barrier to interoperability, and believe that the 
> draft should contain a strong health warning about the use of 
> these: they should be used only in cases where there is explicit 
> agreement between the communicating parties, and never for 
> documents intended for consumption by a general audience.

The EXI WG agrees on this and has added a clarification to the latest
draft.



The comments 4)-7) all concern the representation of simple type
content. These particular items generally allow evaluation by comparing
performance of a sample implementation over our test suite, which shall
be the main criteria for selecting among alternatives. We have scheduled
all of the following for discussion, with 5) and 7) already being under
discussion within the WG. 

Again, without preempting any such future evaluation or discussion, here
is a list of comments on and/or reasons for the current representations:


> 4) Reversing the digits when representing decimal fractions 
> (and fractions of seconds in the date-time datatypes) is 
> very unnatural. We think it is better to use a (total digits, 
> scale factor) pair. Thus instead of representing 12.345 as 
> (12,543) it would be (12345,3). This is one byte longer, 
> but much easier to decode properly.

The WG finds it hard to quantify "very unnatural" and "much easier". Our
intention is to compare performance of either method, and presumably
select the simpler one when there is little difference.


> 5) IEEE float representation is better on all counts than 
> the EXI-specific representation.  It's true that some hardwares 
> can't process it directly, but *no* hardware can process 
> the current EXI representation.

The IEEE float representation tends to be larger than the current
variable length representation. xsd:float is often used to represent
non-scientific data (e.g. a person's age), where this bears
significantly. So at least in that aspect an IEEE 754 representation
will be at a significant disadvantage.

On the plus side, several WG members have significant interest and
experience in using EXI for scientific data transmission and intend to
look very closely at direct IEEE 754 encoding and evaluating the
corresponding issues.


> 6) The current date-time representation expresses a date as
> ((years-2000), (month*31+day), hour*1440+minute*60+seconds, 
> reversed fractional second).  However, logically years and 
> months can be reduced to months, and days can be reduced to 
> seconds, since leap seconds are ignored. We therefore propose 
> the following triple: ((year-2000)*12+month, 
> day*86400+hour*1440+minute*60+seconds scaled, scale factor).  If
> fraction scaling is rejected, this would become ((year-2000)*12+month,
> day*86400+hour*1440+minute*60+seconds, reversed fractional second).

The current representation was modeled after the various XML Schema date
or time related simple types, and tries to encompass all of them.
Merging several fields usually works well for a type that includes both,
but not so much for one that includes only one. An example would be
xsd:gMonthDay, which fits quite naturally into the EXI representation
but not into the proposed one. Other adversely affected types would
include xsd:gYear and xsd:gDay. Types for which the proposed scheme 
may work well would be xsd:duration.

An initial analysis suggests that the differences between the two
methods would mostly be pretty small, except maybe in the cases listed
above. If time permits, we'll use the data found in the test suite to
more accurately assess the two.


> 7) We believe that the current representation of strings has no
> material advantage over UTF-8, since although it uses at most 3 bytes
> per character, 4-byte UTF characters are very rare except in documents
> written in obsolete scripts.

The UTF-8 design incorporates a number of features that are not of much
interest in the case of EXI, such as the ability to discern whether any
byte marks the beginning of a character. While for the popular ASCII
characters the compactness is the same, that is not the case for other
character ranges. Note that the EXI design will always do at least as
well as UTF-8.

E.g., there is a range of code points where EXI uses 2 bytes, versus 3
for UTF-8. Any content in such scripts would therefore be 50% larger in
UTF-8 vs. current EXI. This would include the Devanagari scripts (used
in several Indic languages, including Hindi), Thai, Hangul Jamo (but not
Hangul syllables; Korea), Hiragana and Katakana (but not Kanji/CJK
unified, Japan). The EXI WG can't endorse the rarity claim, as these
scripts appear to be in daily use by easily over one billion people with
little observable tendencies to obsolete any of them.



Again, we'd like to thank you for your thorough review. Due to timing
constraints, the recently released draft will unfortunately not reflect 
much of your recommendations, yet; please bear with us. We sincerely
hope your
attention and criticism will accompany us throughout our way
towards a Recommendation.



[1] EXI Format Specification, 2nd PWD:
http://www.w3.org/TR/2007/WD-exi-20071219/
[2] EXI Primer, 1st PWD:
http://www.w3.org/TR/2007/WD-exi-primer-20071219/
[3] EXI Best Practices, 1st PWD:
http://www.w3.org/TR/2007/WD-exi-best-practices-20071219/



Yours Truly,
The EXI WG
Received on Tuesday, 29 January 2008 20:55:25 UTC