RE: Support of IEEE float; Canonical XML" from John Schneider on 2009-10-10 (public-exi-comments@w3.org from October 2009)

From: John Schneider <john.schneider@agiledelta.com>
Date: Fri, 9 Oct 2009 17:17:08 -0700
To: "'Paul Pierce'" <prp@teleport.com>, "'Taki Kamiya'" <tkamiya@us.fujitsu.com>, "'EXI Comments'" <public-exi-comments@w3.org>
Message-ID: <E94226837AB642618E42AA23FFC093E4@jcsdell8600>
Paul,

The EXI working group would like to take a bit more time to share some of
the rationale and test results that motivate the default EXI floating point
representation (i.e., EXI Float). I hope this information is helpful in
explaining the group's current position on this topic. 

I believe you understand this, but for the broader audience I'd like to make
it clear that the working group's position does not effect whether EXI
supports the full range of floating point numbers defined by XML Schema. EXI
does. It also does not effect whether you can use the IEEE floating point
format to represent values in an EXI stream. You can. This decision is about
which floating point representation is the best fit for the broadest set of
EXI use cases and thus, is the best default representation for EXI.  

With that background, here is a run-down of the primary drivers that
motivate the default EXI floating point representation.

1. Compactness

EXI Float is often more compact than IEEE. As part of our analysis, the EXI
working group tested the compactness of EXI Float vs. IEEE across our test
suite to get an idea which representation provided the best compactness for
more EXI use cases. 

Several of the use cases don't use xsd:float or xsd:double values, so of the
94 tests cases run, there were 18 cases where we saw a difference in
compactness. Below are the results for those 18 cases. The first column in
the table below identifies the test case. The second column shows the size
of the EXI stream in bytes when using the scalable EXI Float representation.
The third column shows the size of the EXI stream in bytes when using the
IEEE representation. The fourth column shows the difference in size between
the two (IEEE size - EXI size). A positive number in this column indicates
IEEE was larger. A negative number in this column indicates EXI was larger.

Test Case                                    EXI-Float    IEEE-Float    IEEE
- EXI
-----------------------------------------    ---------    ----------
----------
AVCL/telemCompTest10M.xml                    1,577,755     3,400,125
1,822,370 
AVCL/telemCompTest1M.xml                       170,985       320,317
149,332 
OpenOffice/OpenDocument-v1.0-os/content.xml  1,258,634     1,259,697
1,063 
OpenOffice/OpenDocument-v1.0-os/styles.xml      10,228        10,231
3 
SeismicData/seis.xml                         6,952,815     6,952,818
3 
Google/google00.xml                              3,147         3,151
4 
Google/google07.xml                              2,435         2,439
4 
Google/google15.xml                              2,942         2,946
4 
Google/google30.xml                              2,942         2,945
3 
HepRep/HEP_G4Data0.heprep                       16,658        31,966
15,308 
HepRep/HEP_G4Data1.heprep                       18,981        34,903
15,922 
HepRep/HEP_G4Data2.heprep                       32,170        51,624
19,454 
HepRep/HEP_G4Data3.heprep                      203,733       271,246
67,513 
HepRep/HEP_G4Data4.heprep                    1,956,907     2,506,521
549,614 
LocationSightings/castaway.xml                      17            19
2 
LocationSightings/libby.xml                         17            16
-1
LocationSightings/robin.xml                         17            16
-1
LocationSightings/ruud.xml                          14            15
1 

As shown in the table above, the EXI representation was more compact for 16
of the 18 test cases above. In the cases that made the most extensive use of
floating point numbers, the EXI representation was dramatically smaller
(10,000+, 100,000+ and even 1,000,000+ bytes smaller). In contrast, there
were only two cases where IEEE was smaller and in those cases it was only 1
byte smaller. 

Compactness is one of the most often cited requirements for EXI use cases
[1]. It improves bandwidth utilization, storage space and transfer speeds.
It is one of the most significant factors involved in improving mobile
device battery life [2][3]. And as Moore's law continues to outpace
bandwidth expansion, it is increasingly one of the more significant factors
involved in improving total system performance.  

2. Parsing Speed 

Parsing speed is another of the most cited requirements for EXI use cases.
The EXI working group tested parsing performance of EXI Float vs. IEEE
across its test suite to determine which one was faster for more use cases.
The results for the 18 test cases listed above are shown below measured in
transactions-per-second (TPS). The first column of the table below lists the
test case. The second column shows the parsing speed using EXI Float. The
third column shows the parsing speed using IEEE Float. The fourth column
shows the ratio of the EXI Float speed to IEEE float speed as a percentage.
A number greater than 100% in this column indicates a case where EXI was
faster. A number less than 100% indicates a case where IEEE was faster. 

Test Case                                    EXI-Float    IEEE-Float    EXI
/ IEEE (%)
-----------------------------------------    ---------    ----------
----------
AVCL/telemCompTest10M.xml                         6.95          1.46
477%
AVCL/telemCompTest1M.xml                         65.07         12.35
527%
OpenOffice/OpenDocument-v1.0-os/content.xml      19.40         19.29
101%
OpenOffice/OpenDocument-v1.0-os/styles.xml    1,236.46      1,216.86
102%
SeismicData/seis.xml                              2.38          2.34
102%
Google/google00.xml                           7,337.40      6,963.24
105%
Google/google07.xml                           7,775.18      7,486.80
104%
Google/google15.xml                           7,605.39      7,173.54
106%
Google/google30.xml                           7,025.08      6,809.04
103%
HepRep/HEP_G4Data0.heprep                       790.59        127.73
619%
HepRep/HEP_G4Data1.heprep                       700.30        120.21
583%
HepRep/HEP_G4Data2.heprep                       452.17         89.45
505%
HepRep/HEP_G4Data3.heprep                        79.76         20.35
392%
HepRep/HEP_G4Data4.heprep                         8.35          2.37
352%
LocationSightings/castaway.xml               89,312.52     61,752.18
145%
LocationSightings/libby.xml                  76,178.17     56,826.29
134%
LocationSightings/robin.xml                  76,979.24     57,158.74
135%
LocationSightings/ruud.xml                   77,099.90     57,224.57
135%

As shown by the table above, the EXI representation was faster for every
test case. In the cases that made the most extensive use of floating point
numbers, EXI was 3 to 6 times faster than IEEE. 

This result was quite surprising to everyone at first so we decided to dig
deeper. On further investigation, we found that faithfully translating the
base-2 IEEE format to a base-10 text format is very complicated and time
consuming, especially for modern algorithms that take rounding accuracy,
appropriate precision, special cases, etc. into consideration. 

The base-2 to base-10 conversion was caused by the use of the SAX parser
interface, which is a text-based interface. This cost can be avoided for use
cases that never require floating point numbers to be converted to / from
strings. However, if the floating point data is ever displayed to a human,
input by a human, converted to XML for interoperability, converted to other
text-based protocols (JSON, FIX, SWIFT, EDI, etc.), routed through any of
the standard XML APIs (SAX, DOM, StAX, etc.), etc., it must be converted to
text and you must incur the associated cost. In addition, if the data is
ever validated using XML Schema, transformed using XSLT, secured using XML
Security, etc. it must also be converted to text. As such, we expect quite a
few use cases would incur the cost above when using IEEE.

3. Small Devices

One of the primary motivations for EXI is to expand the use of XML
technologies to a broader range of use cases and platforms, including
devices with limited processing power and computing resources. Many small
devices don't have built-in IEEE floating point support. Of course, they
don't have support for the EXI floating point representation either.
However, IEEE is a very complex format and implementing it on small devices
could easily exceed the code footprint a device can budget for EXI. 

In contrast, the EXI float is intentionally simple and requires very little
code-footprint to implement on small devices. It was specifically designed
for this purpose and minimizes code footprint requirements by maximizing
reuse of other built-in EXI datatype representations. 

4. Scalable Size

EXI Float is a scalable representation that requires fewer bits for numbers
with less precision. As such, users can adjust precision to achieve higher
levels of compactness where beneficial. For example, when serving an SVG
document to a workstation with a large display and a lot of bandwidth, the
server might use coordinates with a lot of precision. However, when serving
the same SVG document to a mobile devices with limited bandwidth and a
smaller display that doesn't require this level of precision, the server
might reduce coordinate precision to achieve better compactness and save
more bandwidth. 

IEEE is a fixed width floating point representation and does not provide
this level of flexibility.

5. Rounding issues 

Whenever IEEE floating point numbers are converted to / from text so they
can be displayed, input, converted to XML, routed through standard XML APIs,
etc., rounding issues can occur. If the floating point data comes from an
XML document, a human input device or a text-based API, it is not in general
possible for the IEEE format to represent or preserve the original data
accurately. Some of the EXI working group members have implementation
experience with previous binary XML formats that used the IEEE format and
have reported that IEEE rounding issues were often problematic for their
users. For example, when the IEEE format was converted back to text XML for
interoperability, it sometimes resulted in an explosion of decimal digits,
making the XML documents very large and unwieldy. 

As a simple example, the decimal number "0.1" cannot be represented
accurately in base 2. It is represented as 1.100110011... * 2^-4 in base-2.
In the IEEE 754 32-bit binary format, the mantissa gets stored in 23 bits as
00110011001100110011001. When converted back to a decimal number, this
becomes "1.00000001490116119384765625E-1". Fortunately, the IEEE-to-String
conversion routines built into modern programming languages like Java and C#
are pretty good at ironing out these little wrinkles (with some
computational cost). However, this is not always the case. For example in
Java, Double.toString(2e22) returns "1.9999999999999998E23". And in
programming languages like C that don't use modern conversion routines, even
minor rounding errors are a common problem.

The EXI floating point representation does not have this problem. It can
accurately represent and preserve any floating point number, regardless of
whether it comes from a base-10 text-based source or a base-2 typed IEEE
floating point source. 

Summary

So to summarize, the working group's testing and analysis have led us to
conclude that using IEEE floating point as the default representation in EXI
would negatively impact compactness for many/most EXI use cases and
negatively impact processing efficiency for many others. In addition, making
IEEE the default would make EXI impractical for many small devices and
introduce undesirable rounding effects for some use cases. 

That said, we do believe there will be EXI use cases that prefer the IEEE
format. For example, there will be use cases that have no requirements for
backward compatibility with XML, make no use of text-based APIs, perform no
text-based input/output of floating point numbers, prefer processing
efficiency over compactness and do not involve limited devices that don't
support IEEE. If use cases that meet these criteria also end up spending a
significant amount of their overall processing time parsing EXI floating
point data, they may want to replace the default EXI floating point
representation with the IEEE representation. EXI will allow them to do this.
However, we don't expect the majority of EXI use cases to fit into this
category.

I hope the data and analysis I've shared helps to clarify the EXI working
group's direction on this issue. I do understand that you probably have
specific use cases you care deeply about that fall into the category
described above. EXI's ability to use IEEE representation where needed
should make it a good solution for these use cases. However, making IEEE the
default for everyone would compromise our ability to ensure EXI works well
for the rest of the EXI use cases. At the end of the day, we need to balance
our desire to provide the best solution for your use cases with our mission
of creating a data format that works well for all the EXI use cases. 

	Very Respectfully,

	John 

[1] http://www.w3.org/TR/xbc-characterization/#N10105
[2] http://groups.csail.mit.edu/cag/scale/papers/compression-mobisys2003.pdf
[3] http://www.hiit.fi/files/fi/fc/papers/icws06-binary-security.pdf
  

> -----Original Message-----
> From: John Schneider [mailto:john.schneider@agiledelta.com] 
> Sent: Friday, July 24, 2009 10:54 AM
> To: 'Paul Pierce'; 'Taki Kamiya'; 'EXI Comments'
> Subject: RE: Support of IEEE float; Canonical XML"
> 
> Paul,
> 
> This is a personal response and doesn't represent the 
> position of the EXI working group. My company created the 
> Efficient XML technology selected as the basis of the EXI 
> standard, so I assume we are the "implementers" you are 
> referring to below. I know exactly what you're talking about 
> when you say "implementers" will often argue for the status 
> quo. I've seen this myself. More often than not, it is done 
> by a 900 pound gorilla that can throw its weight around to 
> get what it wants. For what its worth, we are one of the 
> smallest companies in the working group and throwing our 
> weight around wouldn't get us very far. We might just 
> generate enough momentum to knock over a teacup. :-)
> 
> Your speculation that the implementer is arguing for the 
> status quo, while rational and informed, is incorrect in this 
> case. We have had Efficient XML implementations both with and 
> without support for IEEE floating point representation for 
> several years now. So, from an implementation standpoint, we 
> have no motivation to go one way or another on this issue. In 
> addition, if you look at the changes made to the EXI 
> specification since we first submitted it, you'll find far 
> more substantial changes than IEEE. Notable examples are 
> self-contained sub-trees, bounded string tables, byte-aligned 
> mode, strict mode and many more. This is clearly not the status quo.
> 
> In my experience, the EXI working group and the XBC working 
> group before it are  motivated primarily by technical 
> arguments backed by concrete test results. They benchmark and 
> test everything before making decisions. They've run 
> benchmarks to test the impact of bounded integers, restricted 
> charsets, bounded-string table algorithms, a simplified all 
> group, etc. And they've run several tests and had lengthy 
> discussions about IEEE. Before the group ran tests and got 
> into the details, I think more than half of the working group 
> members favored IEEE. However, as the group began reviewing 
> the test results, they came to the consensus that the EXI 
> scalable floating point representation was a better fit for 
> most of the EXI use cases than the IEEE representation. 
> 
> To be perfectly clear, there are definitely some use cases 
> that will prefer IEEE floating point representation and EXI 
> will support the use of the IEEE representation for these 
> cases. The question is not whether you will be able to use 
> the IEEE floating point representation with EXI. You will. 
> The question is whether IEEE should be the default for all 
> use cases. W3C tests have shown that making IEEE the default 
> representation will negatively impact compactness for 
> many/most use cases and negatively impact processing 
> performance for many others. In addition, it would make it 
> very difficult for many small devices that don't have 
> built-in IEEE support to process EXI documents that used this 
> default. Implementing IEEE support on such devices would 
> require more code footprint than they can generally spare for EXI.
> 
> The EXI working group has taken a very deep dive on this 
> topic and has a very informed viewpoint on it. They have been 
> looking at it, testing it and analyzing it since before the 
> first draft of the EXI spec was published. They take your 
> comments and feedback very seriously and have discussed and 
> debated each one at length.  
> 
> I do understand that you were not part of the W3C analysis of 
> this topic and do not have the benefit of the associated 
> technical discussions and test results. So, its not 
> completely fair to expect you to be in the same place as the 
> working group on this topic. I'll recommend the working group 
> share some of our test data on this topic so you can better 
> see where they are coming from. 
> 
> 	All the best,
> 
> 	John
> 
> 
> > -----Original Message-----
> > From: public-exi-comments-request@w3.org
> > [mailto:public-exi-comments-request@w3.org] On Behalf Of Paul Pierce
> > Sent: Wednesday, July 22, 2009 10:22 PM
> > To: Taki Kamiya; EXI Comments
> > Subject: "RE: Support of IEEE float; Canonical XML"
> > 
> > All,
> > 
> > The WG seems determined to take what I and apparently many 
> others feel 
> > is obviously the wrong direction on this issue.
> > This is always a danger when trying to make a standard based on an 
> > existing implementation, since the implementors must properly be 
> > closely involved but will usually advocate strongly for the status 
> > quo. This happened in the first standards committee I was 
> on long ago, 
> > the implementers mostly had their way and, I think partly 
> because of 
> > that, the standard ultimately failed. I don't know if that is whats 
> > happening here but the effect is the same.
> > 
> > I will try to summarize my arguments for IEEE floating 
> point here for 
> > reference. I would urge anyone else who agrees to add their view at 
> > this time.
> > 
> > If the proposed recommendation is indeed in "last call", 
> presumably it 
> > will eventually come up for a vote. In the mean time, W3C rules 
> > require that all comments be addressed so to keep things 
> going despite 
> > the WG (and probably everyone else) being tired of the 
> matter I'm also 
> > asking for documentation of the WG claims and evaluation.
> > 
> > If it does come up for a vote, I must reluctantly urge everyone to 
> > vote against the recommendation in its current form.
> > 
> > 
> > 
> > >From our discussion so far, changing to IEEE floating point
> > format would basically require two changes. First, the normal 
> > representation of data identified as XML Schema float or 
> double would 
> > be IEEE 754 32-bit or 64-bit binary, respectively, in the same bit 
> > order as n-bit integer. Second, when the preserve-lexical-values 
> > option is set the data would be represented in character form as in 
> > XML.
> > 
> > > Paul,
> > > 
> > > The WG has taken a comprehensive look at this issue.
> > > 
> > > EXI is a format that is for XML infoset, informed by
> > schemas when they
> > > are available, as opposed to being schema-bound.
> > 
> > The particular case we are discussing here occurs specifically when 
> > EXI encoding is informed by XML Schema.
> > There is no impact on uninformed encoding.
> > 
> > > The goal is
> > > to serve as an efficient alternative encoding of an
> > infoset, for users
> > > exchanging infosets.
> > 
> > EXI can be so much more than a mere efficient encoding of XML 
> > text-based documents. Thats why its a good thing its based 
> on encoding 
> > the infoset and not just gzipping the XML characters. This 
> means its 
> > possible (when the preserve lexical values option is not 
> set) to focus 
> > on data values rather than their character encodings. In 
> the case of 
> > encoding data typed as XML Schema float or double, the 
> infoset data is 
> > specified (on purpose) such that data values can be uniquely 
> > represented in IEEE 754 format.
> > 
> > > 
> > > Given that APIs that are in use today are based-on text,
> > and we do not
> > > expect the landscape to significantly change because of EXI, we 
> > > believe it is logical for us to keep the schema-informed
> > float value
> > > representation akin to text so that EXI float-to-text conversion 
> > > requires minimal processing overhead.
> > 
> > I disagree that the landscape will not change because of EXI; that 
> > would be a clear indication that EXI has failed. But more 
> important, 
> > there are significant APIs in use today that have binary 
> interfaces. 
> > Two classes of such APIs are web services (e.g. the new SOAP-JMS 
> > binding) and the many object binding systems for XML (e.g. 
> XMLBeans).
> > 
> > > 
> > > With that in mind and also considering the better
> > compactness in size
> > > and amenability to round-trip with XML that it provides, 
> it is the 
> > > consensus of the WG that it should benefit the majority of
> > users and
> > > therefore is the best way to go with.
> > 
> > Is the supposed size advantage tested and documented? I find it 
> > difficult to believe that there would be a size advantage 
> except for 
> > human-generated values and sparse data. For 
> machine-generated values 
> > expressed at full precesion (always the likely default), where EXI 
> > would be most useful because of the large quantity of data, 
> it seems 
> > unlikely that conversion from binary to a decimal-based 
> represenation 
> > could be anything but wasteful. For sparse data containing large 
> > quantities of simple values such as 0.0, the proposed 
> format might be 
> > more efficient but the compression option will eliminate 
> most of the 
> > difference.
> > 
> > There is little if any net advantage with the current proposal in 
> > round-trip to XML, in fact, it will likely turn out to have a net 
> > disadvantage compared to IEEE floating point. Both the current spec 
> > and the IEEE option will convert accurately to and from 
> character data 
> > for XML, given a high quality implementation. But for IEEE floating 
> > point, quality implementations already exist and can be 
> leveraged from 
> > existing native language libraries. These are already tuned for 
> > performance, which cannot be expected of EXI 
> implementations that must 
> > be created specifically to conform to a representation that is used 
> > nowhere else, even if it is simpler. Also, because EXI will be used 
> > where XML is too inefficient, the round-trip case will be 
> used only in 
> > special situations such as debugging, so only accuracy 
> matters - not 
> > performance.
> > 
> > What will best serve the majority of users should be indicated from 
> > evaluating the alternatives against the use cases for quantifiable 
> > differences, and against known best practices for intangibles.
> > 
> > The purported advantages of the current proposal:
> > 
> > 1. More compact
> > 
> > 2. Faster round-trip to XML
> > 
> > 3. Otherwise better in round-trip to XML (accuracy is the 
> only aspect 
> > we've discussed.)
> > 
> > 4. Faster generation/parsing with text-based APIs.
> > 
> > 5. Otherwise better with text-based APIs.
> > 
> > I would argue that an IEEE 754 representation would have at least 
> > these advantages, in addition to coopting all the above:
> > 
> > 5. Its a standard.
> > 
> > 6. Its the native representation on almost all computers.
> > 
> > 7. Faster with binary APIs.
> > 
> > 8. Otherwise better with binary APIs.
> > 
> > Items 1, 2, 4 and 7 are quantitative and should have been measured. 
> > Items 2 and 4 and items 3 and 5 are mechanically the same but would 
> > have different weights in the final evaluation.
> > 
> > In combining these items to reach a final conclusion, I give much 
> > higher weight to 7 and 8 (binary APIs) than 4 and 5 (text 
> APIs), and 
> > no weight to 2 (XML), because EXI is needed where 
> efficiency matters 
> > and, if successful, will be used most heavily where the 
> entire path is 
> > most efficient.
> > 
> > 
> > > 
> > > We thank you for your insight on this issue, and express our 
> > > appreciation for your verve in the whole discussion on this
> > and related topics.
> > > 
> > > Regards,
> > > 
> > > Taki Kamiya (for the EXI Working Group)
> > 
> > I'm grateful for the opportunity to participate and hope to 
> make what 
> > little contribution I can to the ultimate success of EXI.
> > 
> > Paul Pierce
> > 
>
Received on Saturday, 10 October 2009 00:17:54 UTC