RE: Support of IEEE float; Canonical XML from Cokus, Michael S. on 2010-08-20 (public-exi-comments@w3.org from August 2010)

From: Cokus, Michael S. <msc@mitre.org>
Date: Fri, 20 Aug 2010 10:45:39 -0400
To: "public-exi-comments@w3.org" <public-exi-comments@w3.org>, "prp@teleport.com" <prp@teleport.com>
Message-ID: <D93B26E07F2DD147A3E17BA1602444B809C98B6D3B@IMCMBX2.MITRE.ORG>
Paul,

I apologize for the delay in responding to your latest email.  Before getting into our point-by-point responses, we wanted to provide an overview of the EXI working group's perspective.  

EXI is a general purpose format.  It is designed to address a wide range of use cases.  The EXI working group has made a sincere effort to fulfill the requirements of the EXI WG Charter.  One of these requirements is to provide a combination of features in EXI which best addresses the XBC use cases as a whole.  If we had developed EXI to address a particular use case, we may have made some design decisions differently.  Concerning floating point representation, we believe we have made choices that provide the best overall benefit. 

We want to note that we have sincerely considered your point of view.  Please remember that the EXI working group has had to consider many other points of view when approaching design decisions for the EXI format.  

Please find our point-by-point responses below (inline).

Thanks,

--mike

Mike Cokus (for the EXI WG)


>John Schneider wrote, in part -
>> ...
>> With that background, here is a run-down of the primary drivers that
>> motivate the default EXI floating point representation.
>>
>> 1. Compactness
>>
>> EXI Float is often more compact than IEEE. As part of our analysis,
>the EXI
>> working group tested the compactness of EXI Float vs. IEEE across our
>test
>> suite to get an idea which representation provided the best
>compactness for
>> more EXI use cases.
>
>I suspected the test cases do not present floating point numbers to full
>precision
>(see http://lists.w3.org/Archives/Public/public-exi-
>comments/2009Jul/0004.html )
>so I did a quick analysis of the number of significant digits used in
>the test cases. The only two sets of interesting cases are AVCL and
>HepRep, these were also the ones showing the greatest advantage to the
>EXI representation. In both sets the floating point datatype is
>xsd:double, but the data is presented to an average of only 6.4 digits
>precision. Given that IEEE double carries more than 16 decimal digits of
>precision, its clear these sample data sets are not good representatives
>for the kind of massive machine-to-machine data transmission and storage
>for which EXI should be well suited.
>
>The real problem, of course, is that there are only two sets of sample
>data that stress floating point at all. This doesn't provide enough
>information to support a decision one way or the other.
>

In accordance with the EXI charter, our goal was to address the use cases from the XBC working group.  Our aim was not to stress any particular floating point format, but rather to address a collection of use cases.  To this end, the public EXI test suite was constructed to provide a representative sample of applicable instances.


>>
>> 2. Parsing Speed
>>
>> ...  However, if the floating point data is ever displayed to a human,
>> input by a human, converted to XML for interoperability, converted to
>other
>> text-based protocols (JSON, FIX, SWIFT, EDI, etc.), routed through any
>of
>> the standard XML APIs (SAX, DOM, StAX, etc.), etc., it must be
>converted to
>> text and you must incur the associated cost. In addition, if the data
>is
>> ever validated using XML Schema, transformed using XSLT, secured using
>XML
>> Security, etc. it must also be converted to text. As such, we expect
>quite a
>> few use cases would incur the cost above when using IEEE.
>
>The whole point of EXI is to handle data efficiently, which means that,
>if successful, there will be a lot of data.
>
>I think the 80/20 (or 90/10 or ...) rule applies here, especially since
>EXI is supposed to break new ground outside the middle ground covered by
>XML. 80% of documents are small, but carry only 20% of the total data;
>20% of documents are huge and carry 80% of data. Its a rule of thumb
>that often applies to e.g. message traffic. The majority of data will be
>carried in huge documents that by their nature must never and will never
>be translated to text. So, except for debugging small test cases, none
>of the translations above apply except for the last three which are
>important parts of XML automation. For huge documents, XML Schema
>validation and XSLT transformation must be done using carefully crafted
>processors in any case or they will take forever. Because of the nature
>of EXI, XML Schema validation should be easy and quick without
>translating anything into text. XML Security we have already discussed,
>best practice means securing the binary representation instead of
>translating it into text, so thats not a problem.

This is a reasonable rule-of-thumb for some environments.  However, our goal is to address the XBC use cases.  This examination requires us to consider a wide range of environments.  The rule-of-thumb you describe is not generically applicable across the use cases under our consideration.  

Additionally, we don't agree with the conclusion that large documents would never require translation.  Any document serialized or parsed through one of the standard XML APIs will incur this penalty. And we don't see any evidence which suggests that everyone with large documents will stop using the standard XML APIs.


>
>On the other hand, the existing default EXI floating point
>representation means that there is a huge performance penalty associated
>with massive quantities of floating point data. Because the EXI
>representation is decimal based, every machine generated number has to
>go through the expensive translation step in both directions before it
>gets into a machine that wants to read it. This is a recognized concern
>(see http://lists.w3.org/Archives/Public/public-exi-comments/2009Oct/0001.html.)
>Because most of the floating point data encoded with EXI will be carried
>in huge documents where none of the "small devices" concerns below
>apply, the default representation should be IEEE binary to avoid having
>to specify that explicitly.
>

The working group disagrees that most floating point data will be carried in large documents. For example, most of the floating point data in sensor network and geo-location applications is carried in high-volume streams of tiny messages. We also disagree that none of the small device concerns apply in the case of large documents.


>>
>> 3. Small Devices
>>
>> One of the primary motivations for EXI is to expand the use of XML
>> technologies to a broader range of use cases and platforms, including
>> devices with limited processing power and computing resources.
>
>Three points:
> - Devices small enough that they don't like IEEE floating point are
>unlikely to like floating point at all.
> - I suspect that in the vast majority of these use cases, the numbers
>are geometrically "near" 1.0 in magnitude. They don't need (or want, see
>point 1) exponents and should be type xsd:decimal or equivalent, not
>xsd:float.
> - These are new applications so usually will not have to worry about
>legacy schemas. Since the schemas are new, they can specify the datatype
>correctly to begin with and won't use floating point types except by
>mistake.
>
>These points do not mean that the EXI floating point representation is
>not desirable. There will be some legacy schemas required, and there
>will be some mistakes in creating schemas, and there might even be cases
>where crude floating point is actually required, so there could be a
>representation to help in those cases. But since it will be used rarely
>and the amount of data encoded by it will be small, it should be the
>exception rather than the default. Moreover I think its likely simply
>coercing floating point to the XML character representation would be
>simpler and perfectly adequate, especially for use cases where the only
>need for floating point support is to display it in human readable form.
>

Members of our group with extensive experience in the small/mobile devices realm do not agree with this assessment.  It is envisioned that EXI will enable connections between small/mobile devices and the rest of the web to be seamless with regard to data exchange.  In other words, it cannot be assumed that XML/EXI sent to small/mobile devices would be designed/intended for them exclusively.  


>> 4. Scalable Size
>>
>> EXI Float is a scalable representation that requires fewer bits for
>numbers
>> with less precision.
>
>I suspect this capability is really only needed for small devices, so
>the points above apply.
>

Scalable size is an effective approach to achieving compactness.  And Compactness is a "must have" property for the majority of use cases identified by the XBC working group.  A scalable floating point format is applicable to a wide range of use cases.


>> 5. Rounding issues
>
>There is a great deal to discuss here, some we have already covered in
>this thread and perhaps some new points that should be considered more
>carefully. But I think the efficiency issue is far more important and
>should be settled first.
>

The salient point is that rounding issues exist.  These would adversely affect interoperability with existing XML specifications and technologies.  Maintaining existing XML interoperability is a chartered goal for the EXI WG.


>To summarize, I think the task force has not so far had the benefit of
>sufficient sample data to accurately predict the performance of EXI with
>floating point data. I think that the use cases that need both floating
>point and the efficiency of EXI will be primarily machine to machine,
>and will ultimately produce by far the bulk of floating point data in
>EXI encoded documents. I think its clear the overall processing speed
>for such use cases will be much better with an IEEE representation (and
>also clearly there can be no rounding problems), and I think it has not
>been demonstrated (and is unlikely that) either representation has any
>significant advantage in compactness for full precision data.
>
>In addition, I think that use cases (for small devices) that would
>require the current EXI representation are unusual, undesirable and
>mostly avoidable.
>
>Based on this, I still think the default representation of floating
>point data should be IEEE binary.
>
>Paul Pierce


Mike Cokus
The MITRE Corporation
757-896-8553; 757-826-8316 (fax)
903 Enterprise Parkway, Hampton, VA
Received on Friday, 20 August 2010 14:46:14 UTC