- From: John Schneider <john.schneider@agiledelta.com>
- Date: Fri, 9 Oct 2009 17:17:08 -0700
- To: "'Paul Pierce'" <prp@teleport.com>, "'Taki Kamiya'" <tkamiya@us.fujitsu.com>, "'EXI Comments'" <public-exi-comments@w3.org>
Paul, The EXI working group would like to take a bit more time to share some of the rationale and test results that motivate the default EXI floating point representation (i.e., EXI Float). I hope this information is helpful in explaining the group's current position on this topic. I believe you understand this, but for the broader audience I'd like to make it clear that the working group's position does not effect whether EXI supports the full range of floating point numbers defined by XML Schema. EXI does. It also does not effect whether you can use the IEEE floating point format to represent values in an EXI stream. You can. This decision is about which floating point representation is the best fit for the broadest set of EXI use cases and thus, is the best default representation for EXI. With that background, here is a run-down of the primary drivers that motivate the default EXI floating point representation. 1. Compactness EXI Float is often more compact than IEEE. As part of our analysis, the EXI working group tested the compactness of EXI Float vs. IEEE across our test suite to get an idea which representation provided the best compactness for more EXI use cases. Several of the use cases don't use xsd:float or xsd:double values, so of the 94 tests cases run, there were 18 cases where we saw a difference in compactness. Below are the results for those 18 cases. The first column in the table below identifies the test case. The second column shows the size of the EXI stream in bytes when using the scalable EXI Float representation. The third column shows the size of the EXI stream in bytes when using the IEEE representation. The fourth column shows the difference in size between the two (IEEE size - EXI size). A positive number in this column indicates IEEE was larger. A negative number in this column indicates EXI was larger. Test Case EXI-Float IEEE-Float IEEE - EXI ----------------------------------------- --------- ---------- ---------- AVCL/telemCompTest10M.xml 1,577,755 3,400,125 1,822,370 AVCL/telemCompTest1M.xml 170,985 320,317 149,332 OpenOffice/OpenDocument-v1.0-os/content.xml 1,258,634 1,259,697 1,063 OpenOffice/OpenDocument-v1.0-os/styles.xml 10,228 10,231 3 SeismicData/seis.xml 6,952,815 6,952,818 3 Google/google00.xml 3,147 3,151 4 Google/google07.xml 2,435 2,439 4 Google/google15.xml 2,942 2,946 4 Google/google30.xml 2,942 2,945 3 HepRep/HEP_G4Data0.heprep 16,658 31,966 15,308 HepRep/HEP_G4Data1.heprep 18,981 34,903 15,922 HepRep/HEP_G4Data2.heprep 32,170 51,624 19,454 HepRep/HEP_G4Data3.heprep 203,733 271,246 67,513 HepRep/HEP_G4Data4.heprep 1,956,907 2,506,521 549,614 LocationSightings/castaway.xml 17 19 2 LocationSightings/libby.xml 17 16 -1 LocationSightings/robin.xml 17 16 -1 LocationSightings/ruud.xml 14 15 1 As shown in the table above, the EXI representation was more compact for 16 of the 18 test cases above. In the cases that made the most extensive use of floating point numbers, the EXI representation was dramatically smaller (10,000+, 100,000+ and even 1,000,000+ bytes smaller). In contrast, there were only two cases where IEEE was smaller and in those cases it was only 1 byte smaller. Compactness is one of the most often cited requirements for EXI use cases [1]. It improves bandwidth utilization, storage space and transfer speeds. It is one of the most significant factors involved in improving mobile device battery life [2][3]. And as Moore's law continues to outpace bandwidth expansion, it is increasingly one of the more significant factors involved in improving total system performance. 2. Parsing Speed Parsing speed is another of the most cited requirements for EXI use cases. The EXI working group tested parsing performance of EXI Float vs. IEEE across its test suite to determine which one was faster for more use cases. The results for the 18 test cases listed above are shown below measured in transactions-per-second (TPS). The first column of the table below lists the test case. The second column shows the parsing speed using EXI Float. The third column shows the parsing speed using IEEE Float. The fourth column shows the ratio of the EXI Float speed to IEEE float speed as a percentage. A number greater than 100% in this column indicates a case where EXI was faster. A number less than 100% indicates a case where IEEE was faster. Test Case EXI-Float IEEE-Float EXI / IEEE (%) ----------------------------------------- --------- ---------- ---------- AVCL/telemCompTest10M.xml 6.95 1.46 477% AVCL/telemCompTest1M.xml 65.07 12.35 527% OpenOffice/OpenDocument-v1.0-os/content.xml 19.40 19.29 101% OpenOffice/OpenDocument-v1.0-os/styles.xml 1,236.46 1,216.86 102% SeismicData/seis.xml 2.38 2.34 102% Google/google00.xml 7,337.40 6,963.24 105% Google/google07.xml 7,775.18 7,486.80 104% Google/google15.xml 7,605.39 7,173.54 106% Google/google30.xml 7,025.08 6,809.04 103% HepRep/HEP_G4Data0.heprep 790.59 127.73 619% HepRep/HEP_G4Data1.heprep 700.30 120.21 583% HepRep/HEP_G4Data2.heprep 452.17 89.45 505% HepRep/HEP_G4Data3.heprep 79.76 20.35 392% HepRep/HEP_G4Data4.heprep 8.35 2.37 352% LocationSightings/castaway.xml 89,312.52 61,752.18 145% LocationSightings/libby.xml 76,178.17 56,826.29 134% LocationSightings/robin.xml 76,979.24 57,158.74 135% LocationSightings/ruud.xml 77,099.90 57,224.57 135% As shown by the table above, the EXI representation was faster for every test case. In the cases that made the most extensive use of floating point numbers, EXI was 3 to 6 times faster than IEEE. This result was quite surprising to everyone at first so we decided to dig deeper. On further investigation, we found that faithfully translating the base-2 IEEE format to a base-10 text format is very complicated and time consuming, especially for modern algorithms that take rounding accuracy, appropriate precision, special cases, etc. into consideration. The base-2 to base-10 conversion was caused by the use of the SAX parser interface, which is a text-based interface. This cost can be avoided for use cases that never require floating point numbers to be converted to / from strings. However, if the floating point data is ever displayed to a human, input by a human, converted to XML for interoperability, converted to other text-based protocols (JSON, FIX, SWIFT, EDI, etc.), routed through any of the standard XML APIs (SAX, DOM, StAX, etc.), etc., it must be converted to text and you must incur the associated cost. In addition, if the data is ever validated using XML Schema, transformed using XSLT, secured using XML Security, etc. it must also be converted to text. As such, we expect quite a few use cases would incur the cost above when using IEEE. 3. Small Devices One of the primary motivations for EXI is to expand the use of XML technologies to a broader range of use cases and platforms, including devices with limited processing power and computing resources. Many small devices don't have built-in IEEE floating point support. Of course, they don't have support for the EXI floating point representation either. However, IEEE is a very complex format and implementing it on small devices could easily exceed the code footprint a device can budget for EXI. In contrast, the EXI float is intentionally simple and requires very little code-footprint to implement on small devices. It was specifically designed for this purpose and minimizes code footprint requirements by maximizing reuse of other built-in EXI datatype representations. 4. Scalable Size EXI Float is a scalable representation that requires fewer bits for numbers with less precision. As such, users can adjust precision to achieve higher levels of compactness where beneficial. For example, when serving an SVG document to a workstation with a large display and a lot of bandwidth, the server might use coordinates with a lot of precision. However, when serving the same SVG document to a mobile devices with limited bandwidth and a smaller display that doesn't require this level of precision, the server might reduce coordinate precision to achieve better compactness and save more bandwidth. IEEE is a fixed width floating point representation and does not provide this level of flexibility. 5. Rounding issues Whenever IEEE floating point numbers are converted to / from text so they can be displayed, input, converted to XML, routed through standard XML APIs, etc., rounding issues can occur. If the floating point data comes from an XML document, a human input device or a text-based API, it is not in general possible for the IEEE format to represent or preserve the original data accurately. Some of the EXI working group members have implementation experience with previous binary XML formats that used the IEEE format and have reported that IEEE rounding issues were often problematic for their users. For example, when the IEEE format was converted back to text XML for interoperability, it sometimes resulted in an explosion of decimal digits, making the XML documents very large and unwieldy. As a simple example, the decimal number "0.1" cannot be represented accurately in base 2. It is represented as 1.100110011... * 2^-4 in base-2. In the IEEE 754 32-bit binary format, the mantissa gets stored in 23 bits as 00110011001100110011001. When converted back to a decimal number, this becomes "1.00000001490116119384765625E-1". Fortunately, the IEEE-to-String conversion routines built into modern programming languages like Java and C# are pretty good at ironing out these little wrinkles (with some computational cost). However, this is not always the case. For example in Java, Double.toString(2e22) returns "1.9999999999999998E23". And in programming languages like C that don't use modern conversion routines, even minor rounding errors are a common problem. The EXI floating point representation does not have this problem. It can accurately represent and preserve any floating point number, regardless of whether it comes from a base-10 text-based source or a base-2 typed IEEE floating point source. Summary So to summarize, the working group's testing and analysis have led us to conclude that using IEEE floating point as the default representation in EXI would negatively impact compactness for many/most EXI use cases and negatively impact processing efficiency for many others. In addition, making IEEE the default would make EXI impractical for many small devices and introduce undesirable rounding effects for some use cases. That said, we do believe there will be EXI use cases that prefer the IEEE format. For example, there will be use cases that have no requirements for backward compatibility with XML, make no use of text-based APIs, perform no text-based input/output of floating point numbers, prefer processing efficiency over compactness and do not involve limited devices that don't support IEEE. If use cases that meet these criteria also end up spending a significant amount of their overall processing time parsing EXI floating point data, they may want to replace the default EXI floating point representation with the IEEE representation. EXI will allow them to do this. However, we don't expect the majority of EXI use cases to fit into this category. I hope the data and analysis I've shared helps to clarify the EXI working group's direction on this issue. I do understand that you probably have specific use cases you care deeply about that fall into the category described above. EXI's ability to use IEEE representation where needed should make it a good solution for these use cases. However, making IEEE the default for everyone would compromise our ability to ensure EXI works well for the rest of the EXI use cases. At the end of the day, we need to balance our desire to provide the best solution for your use cases with our mission of creating a data format that works well for all the EXI use cases. Very Respectfully, John [1] http://www.w3.org/TR/xbc-characterization/#N10105 [2] http://groups.csail.mit.edu/cag/scale/papers/compression-mobisys2003.pdf [3] http://www.hiit.fi/files/fi/fc/papers/icws06-binary-security.pdf > -----Original Message----- > From: John Schneider [mailto:john.schneider@agiledelta.com] > Sent: Friday, July 24, 2009 10:54 AM > To: 'Paul Pierce'; 'Taki Kamiya'; 'EXI Comments' > Subject: RE: Support of IEEE float; Canonical XML" > > Paul, > > This is a personal response and doesn't represent the > position of the EXI working group. My company created the > Efficient XML technology selected as the basis of the EXI > standard, so I assume we are the "implementers" you are > referring to below. I know exactly what you're talking about > when you say "implementers" will often argue for the status > quo. I've seen this myself. More often than not, it is done > by a 900 pound gorilla that can throw its weight around to > get what it wants. For what its worth, we are one of the > smallest companies in the working group and throwing our > weight around wouldn't get us very far. We might just > generate enough momentum to knock over a teacup. :-) > > Your speculation that the implementer is arguing for the > status quo, while rational and informed, is incorrect in this > case. We have had Efficient XML implementations both with and > without support for IEEE floating point representation for > several years now. So, from an implementation standpoint, we > have no motivation to go one way or another on this issue. In > addition, if you look at the changes made to the EXI > specification since we first submitted it, you'll find far > more substantial changes than IEEE. Notable examples are > self-contained sub-trees, bounded string tables, byte-aligned > mode, strict mode and many more. This is clearly not the status quo. > > In my experience, the EXI working group and the XBC working > group before it are motivated primarily by technical > arguments backed by concrete test results. They benchmark and > test everything before making decisions. They've run > benchmarks to test the impact of bounded integers, restricted > charsets, bounded-string table algorithms, a simplified all > group, etc. And they've run several tests and had lengthy > discussions about IEEE. Before the group ran tests and got > into the details, I think more than half of the working group > members favored IEEE. However, as the group began reviewing > the test results, they came to the consensus that the EXI > scalable floating point representation was a better fit for > most of the EXI use cases than the IEEE representation. > > To be perfectly clear, there are definitely some use cases > that will prefer IEEE floating point representation and EXI > will support the use of the IEEE representation for these > cases. The question is not whether you will be able to use > the IEEE floating point representation with EXI. You will. > The question is whether IEEE should be the default for all > use cases. W3C tests have shown that making IEEE the default > representation will negatively impact compactness for > many/most use cases and negatively impact processing > performance for many others. In addition, it would make it > very difficult for many small devices that don't have > built-in IEEE support to process EXI documents that used this > default. Implementing IEEE support on such devices would > require more code footprint than they can generally spare for EXI. > > The EXI working group has taken a very deep dive on this > topic and has a very informed viewpoint on it. They have been > looking at it, testing it and analyzing it since before the > first draft of the EXI spec was published. They take your > comments and feedback very seriously and have discussed and > debated each one at length. > > I do understand that you were not part of the W3C analysis of > this topic and do not have the benefit of the associated > technical discussions and test results. So, its not > completely fair to expect you to be in the same place as the > working group on this topic. I'll recommend the working group > share some of our test data on this topic so you can better > see where they are coming from. > > All the best, > > John > > > > -----Original Message----- > > From: public-exi-comments-request@w3.org > > [mailto:public-exi-comments-request@w3.org] On Behalf Of Paul Pierce > > Sent: Wednesday, July 22, 2009 10:22 PM > > To: Taki Kamiya; EXI Comments > > Subject: "RE: Support of IEEE float; Canonical XML" > > > > All, > > > > The WG seems determined to take what I and apparently many > others feel > > is obviously the wrong direction on this issue. > > This is always a danger when trying to make a standard based on an > > existing implementation, since the implementors must properly be > > closely involved but will usually advocate strongly for the status > > quo. This happened in the first standards committee I was > on long ago, > > the implementers mostly had their way and, I think partly > because of > > that, the standard ultimately failed. I don't know if that is whats > > happening here but the effect is the same. > > > > I will try to summarize my arguments for IEEE floating > point here for > > reference. I would urge anyone else who agrees to add their view at > > this time. > > > > If the proposed recommendation is indeed in "last call", > presumably it > > will eventually come up for a vote. In the mean time, W3C rules > > require that all comments be addressed so to keep things > going despite > > the WG (and probably everyone else) being tired of the > matter I'm also > > asking for documentation of the WG claims and evaluation. > > > > If it does come up for a vote, I must reluctantly urge everyone to > > vote against the recommendation in its current form. > > > > > > > > >From our discussion so far, changing to IEEE floating point > > format would basically require two changes. First, the normal > > representation of data identified as XML Schema float or > double would > > be IEEE 754 32-bit or 64-bit binary, respectively, in the same bit > > order as n-bit integer. Second, when the preserve-lexical-values > > option is set the data would be represented in character form as in > > XML. > > > > > Paul, > > > > > > The WG has taken a comprehensive look at this issue. > > > > > > EXI is a format that is for XML infoset, informed by > > schemas when they > > > are available, as opposed to being schema-bound. > > > > The particular case we are discussing here occurs specifically when > > EXI encoding is informed by XML Schema. > > There is no impact on uninformed encoding. > > > > > The goal is > > > to serve as an efficient alternative encoding of an > > infoset, for users > > > exchanging infosets. > > > > EXI can be so much more than a mere efficient encoding of XML > > text-based documents. Thats why its a good thing its based > on encoding > > the infoset and not just gzipping the XML characters. This > means its > > possible (when the preserve lexical values option is not > set) to focus > > on data values rather than their character encodings. In > the case of > > encoding data typed as XML Schema float or double, the > infoset data is > > specified (on purpose) such that data values can be uniquely > > represented in IEEE 754 format. > > > > > > > > Given that APIs that are in use today are based-on text, > > and we do not > > > expect the landscape to significantly change because of EXI, we > > > believe it is logical for us to keep the schema-informed > > float value > > > representation akin to text so that EXI float-to-text conversion > > > requires minimal processing overhead. > > > > I disagree that the landscape will not change because of EXI; that > > would be a clear indication that EXI has failed. But more > important, > > there are significant APIs in use today that have binary > interfaces. > > Two classes of such APIs are web services (e.g. the new SOAP-JMS > > binding) and the many object binding systems for XML (e.g. > XMLBeans). > > > > > > > > With that in mind and also considering the better > > compactness in size > > > and amenability to round-trip with XML that it provides, > it is the > > > consensus of the WG that it should benefit the majority of > > users and > > > therefore is the best way to go with. > > > > Is the supposed size advantage tested and documented? I find it > > difficult to believe that there would be a size advantage > except for > > human-generated values and sparse data. For > machine-generated values > > expressed at full precesion (always the likely default), where EXI > > would be most useful because of the large quantity of data, > it seems > > unlikely that conversion from binary to a decimal-based > represenation > > could be anything but wasteful. For sparse data containing large > > quantities of simple values such as 0.0, the proposed > format might be > > more efficient but the compression option will eliminate > most of the > > difference. > > > > There is little if any net advantage with the current proposal in > > round-trip to XML, in fact, it will likely turn out to have a net > > disadvantage compared to IEEE floating point. Both the current spec > > and the IEEE option will convert accurately to and from > character data > > for XML, given a high quality implementation. But for IEEE floating > > point, quality implementations already exist and can be > leveraged from > > existing native language libraries. These are already tuned for > > performance, which cannot be expected of EXI > implementations that must > > be created specifically to conform to a representation that is used > > nowhere else, even if it is simpler. Also, because EXI will be used > > where XML is too inefficient, the round-trip case will be > used only in > > special situations such as debugging, so only accuracy > matters - not > > performance. > > > > What will best serve the majority of users should be indicated from > > evaluating the alternatives against the use cases for quantifiable > > differences, and against known best practices for intangibles. > > > > The purported advantages of the current proposal: > > > > 1. More compact > > > > 2. Faster round-trip to XML > > > > 3. Otherwise better in round-trip to XML (accuracy is the > only aspect > > we've discussed.) > > > > 4. Faster generation/parsing with text-based APIs. > > > > 5. Otherwise better with text-based APIs. > > > > I would argue that an IEEE 754 representation would have at least > > these advantages, in addition to coopting all the above: > > > > 5. Its a standard. > > > > 6. Its the native representation on almost all computers. > > > > 7. Faster with binary APIs. > > > > 8. Otherwise better with binary APIs. > > > > Items 1, 2, 4 and 7 are quantitative and should have been measured. > > Items 2 and 4 and items 3 and 5 are mechanically the same but would > > have different weights in the final evaluation. > > > > In combining these items to reach a final conclusion, I give much > > higher weight to 7 and 8 (binary APIs) than 4 and 5 (text > APIs), and > > no weight to 2 (XML), because EXI is needed where > efficiency matters > > and, if successful, will be used most heavily where the > entire path is > > most efficient. > > > > > > > > > > We thank you for your insight on this issue, and express our > > > appreciation for your verve in the whole discussion on this > > and related topics. > > > > > > Regards, > > > > > > Taki Kamiya (for the EXI Working Group) > > > > I'm grateful for the opportunity to participate and hope to > make what > > little contribution I can to the ultimate success of EXI. > > > > Paul Pierce > > >
Received on Saturday, 10 October 2009 00:17:54 UTC