- From: Paul Pierce <prp@teleport.com>
- Date: 11 Oct 2007 06:20:25
- To: "W3C EXI Public" <public-exi@w3.org>
I would like to suggest taking a different approach for representation of the decimal built in datatype. The method specified in the draft is clever and compact for a specific interesting case, but here (as elsewhere in the spec) there is an unwarranted dependency on base 10 representation. Since most EXI-formatted data will be produced and consumed in binary and never be translated to/from XML, I think it would be better to consider a pure binary representation, perhaps a fraction using a pair of integers as numerator and denominator. The draft method can represent exact numeric values that were originally specified in decimal form very compactly. However, it is inefficient and fragile in requiring the encoder and decoder to translate the fraction to and from decimal, and to reverse and trim digits. Representation of originally binary fractional data may be very inefficient. A binary fraction method can represent the same exact numeric values and more. It will often require one or two more octets, but with compression repetitious denominator values will tend to disappear. In addition, it naturally carries precision as well as numeric value. The contrast between the two methods is particularly evident in the very common case of binary scientific or multimedia data prescaled to the range (0 - 1] or [-1 - 1]. As background, consider that the XML specification does not mention any representation for numeric values. All values in XML are character strings, in keeping with the goal that XML be human readable. Data types for numeric values appear in schema specifications, and most of the EXI datatypes are (quite properly) motivated by those in XML Schema. Since XML Schema specifies datatype representation in XML and therefore character strings that are supposed to be human readable, it is natural that numeric values are specified to be in decimal notation. My point here is that the XML Schema datatype decimal and its derivatives (integer etc.) are intended to represent the rational numbers. The spec says, "a subset of the real numbers, which can be represented by decimal numerals", but the restriction to decimal is an artifact of the necessity of using a human readable character representation. I wouldn't argue that its necessary to include those rational numbers that can't be represented in decimal, but just that its not necessary to cater to XML Schema's restrictions. Most of the use cases in the Binary Characterization reports tend to either use integers or express values in floating point. I think EXI will open the door to many other uses that now have ad-hoc formats. Still, there are two use cases in the reports that are good examples of the need for rational number values. The financial data use case requires a way to express exact decimal fractional values of currency. The sensor data use case might have need of a simple way to express prescaled binary data. The following table shows a comparison of compactness between the draft method using decimal based encoding of the fraction and a binary method using the same sequence of EXI Boolean, Integer and Unsigned Integer but where the Integer is the numerator and the Unsigned Integer is the denominator which, when divided, yield the rational number value. The finance test datasets consist of numbers between 1 and 1000000, with more smaller numbers than larger, to a precision of 2 or 3 places after the decimal point. The binary test datasets consist of binary values of 8, 12 or 16 bits with the binary point at the far left so the values fall in the range (0-1]. There are two cases, fixed precision where all bits are retained and "adaptive" precision where trailing zeroes are trimmed. (The finance datasets and draft method are both always adaptive.) The first four columns of results give the average number of bits required to express a number in the dataset, with or without compression. The last two columns show the percentage advantage of the more compact representation, where parentheses indicate the advantage goes to the EXI draft method. Encoded Compressed Encoded Compressed Dataset Decimal Binary Decimal Binary Advantage Advantage Finance Cents 21.8 28.8 22.6 23.2 (32.1) (2.7) Finance Mils 28.8 39.4 27.3 27.5 (36.8) (0.6) Binary 8-bits fixed 38.4 29.0 14.3 12.9 32.4 11.0 Binary 12-bits fixed 54.3 32.7 21.0 17.1 66.1 22.5 Binary 16-bits fixed 69.3 47.0 37.0 23.1 47.4 60.5 Binary 8-bits adaptive 38.4 25.0 14.3 12.7 53.6 12.1 Binary 12-bits adaptive 54.4 32.0 21.0 17.4 70.0 20.7 Binary 16-bits adaptive 69.3 43.9 37.0 23.4 57.9 57.9 Source material for this test is at http://www.bx-lib.org The results clearly show that each method works best with its own kind of data. But the draft method does much worse on binary data than the binary method does on decimal data. And with compression the binary method never performs poorly. The binary method is simple and requires no more bits in the processing than are in the data. The draft method is more complex, and can only process up to 9-bit binary data in 32-bit arithmetic with straightforward algorithms. This shows that in addition to being slightly more complex, it is also fragile. Implementations using the obvious algorithms are likely to fail on certain kinds of data. Based on these arguments and examples I feel the working group should reconsider the representation and maybe the name of the EXI Decimal datatype. I would propose a binary fraction as an alternative representation, as I haven't been able to think of anything better in terms of simplicity and efficiency. Paul Pierce
Received on Friday, 12 October 2007 20:35:07 UTC