"RE: Support of IEEE float; Canonical XML"

 John Schneider wrote, in part -
> ...
> With that background, here is a run-down of the primary drivers that
> motivate the default EXI floating point representation.
> 
> 1. Compactness
> 
> EXI Float is often more compact than IEEE. As part of our analysis, the EXI
> working group tested the compactness of EXI Float vs. IEEE across our test
> suite to get an idea which representation provided the best compactness for
> more EXI use cases.

I suspected the test cases do not present floating point numbers to full precision
(see http://lists.w3.org/Archives/Public/public-exi-comments/2009Jul/0004.html )
so I did a quick analysis of the number of significant digits used in the test cases. The only two sets of interesting cases are AVCL and HepRep, these were also the ones showing the greatest advantage to the EXI representation. In both sets the floating point datatype is xsd:double, but the data is presented to an average of only 6.4 digits precision. Given that IEEE double carries more than 16 decimal digits of precision, its clear these sample data sets are not good representatives for the kind of massive machine-to-machine data transmission and storage for which EXI should be well suited.

The real problem, of course, is that there are only two sets of sample data that stress floating point at all. This doesn't provide enough information to support a decision one way or the other.

> 
> 2. Parsing Speed 
> 
> ...  However, if the floating point data is ever displayed to a human,
> input by a human, converted to XML for interoperability, converted to other
> text-based protocols (JSON, FIX, SWIFT, EDI, etc.), routed through any of
> the standard XML APIs (SAX, DOM, StAX, etc.), etc., it must be converted to
> text and you must incur the associated cost. In addition, if the data is
> ever validated using XML Schema, transformed using XSLT, secured using XML
> Security, etc. it must also be converted to text. As such, we expect quite a
> few use cases would incur the cost above when using IEEE.

The whole point of EXI is to handle data efficiently, which means that, if successful, there will be a lot of data.

I think the 80/20 (or 90/10 or ...) rule applies here, especially since EXI is supposed to break new ground outside the middle ground covered by XML. 80% of documents are small, but carry only 20% of the total data; 20% of documents are huge and carry 80% of data. Its a rule of thumb that often applies to e.g. message traffic. The majority of data will be carried in huge documents that by their nature must never and will never be translated to text. So, except for debugging small test cases, none of the translations above apply except for the last three which are important parts of XML automation. For huge documents, XML Schema validation and XSLT transformation must be done using carefully crafted processors in any case or they will take forever. Because of the nature of EXI, XML Schema validation should be easy and quick without translating anything into text. XML Security we have already discussed, best practice means securing the binary representation instead of translating it into text, so thats not a problem.

On the other hand, the existing default EXI floating point representation means that there is a huge performance penalty associated with massive quantities of floating point data. Because the EXI representation is decimal based, every machine generated number has to go through the expensive translation step in both directions before it gets into a machine that wants to read it. This is a recognized concern
(see http://lists.w3.org/Archives/Public/public-exi-comments/2009Oct/0001.html .)
Because most of the floating point data encoded with EXI will be carried in huge documents where none of the "small devices" concerns below apply, the default representation should be IEEE binary to avoid having to specify that explicitly.

> 
> 3. Small Devices
> 
> One of the primary motivations for EXI is to expand the use of XML
> technologies to a broader range of use cases and platforms, including
> devices with limited processing power and computing resources.

Three points:
 - Devices small enough that they don't like IEEE floating point are unlikely to like floating point at all.
 - I suspect that in the vast majority of these use cases, the numbers are geometrically "near" 1.0 in magnitude. They don't need (or want, see point 1) exponents and should be type xsd:decimal or equivalent, not xsd:float.
 - These are new applications so usually will not have to worry about legacy schemas. Since the schemas are new, they can specify the datatype correctly to begin with and won't use floating point types except by mistake.

These points do not mean that the EXI floating point representation is not desirable. There will be some legacy schemas required, and there will be some mistakes in creating schemas, and there might even be cases where crude floating point is actually required, so there could be a representation to help in those cases. But since it will be used rarely and the amount of data encoded by it will be small, it should be the exception rather than the default. Moreover I think its likely simply coercing floating point to the XML character representation would be simpler and perfectly adequate, especially for use cases where the only need for floating point support is to display it in human readable form.

> 4. Scalable Size
> 
> EXI Float is a scalable representation that requires fewer bits for numbers
> with less precision.

I suspect this capability is really only needed for small devices, so the points above apply.

> 5. Rounding issues 

There is a great deal to discuss here, some we have already covered in this thread and perhaps some new points that should be considered more carefully. But I think the efficiency issue is far more important and should be settled first.


To summarize, I think the task force has not so far had the benefit of sufficient sample data to accurately predict the performance of EXI with floating point data. I think that the use cases that need both floating point and the efficiency of EXI will be primarily machine to machine, and will ultimately produce by far the bulk of floating point data in EXI encoded documents. I think its clear the overall processing speed for such use cases will be much better with an IEEE representation (and also clearly there can be no rounding problems), and I think it has not been demonstrated (and is unlikely that) either representation has any significant advantage in compactness for full precision data.

In addition, I think that use cases (for small devices) that would require the current EXI representation are unusual, undesirable and mostly avoidable.

Based on this, I still think the default representation of floating point data should be IEEE binary.

Paul Pierce

Received on Friday, 29 January 2010 19:02:51 UTC