"Number representation" from Paul Pierce on 2007-10-11 (public-exi@w3.org from October 2007)

From: Paul Pierce <prp@teleport.com>
Date: 11 Oct 2007 06:20:25
To: "W3C EXI Public" <public-exi@w3.org>
Message-ID: <20071009T191510Z.5.prp@teleport.com>
I would like to suggest taking a different approach for representation of the
decimal built in datatype. The method specified in the draft is clever and
compact for a specific interesting case, but here (as elsewhere in the spec)
there is an unwarranted dependency on base 10 representation. Since most
EXI-formatted data will be produced and consumed in binary and never be
translated to/from XML, I think it would be better to consider a pure binary
representation, perhaps a fraction using a pair of integers as numerator and
denominator.

The draft method can represent exact numeric values that were originally
specified in decimal form very compactly. However, it is inefficient and
fragile in requiring the encoder and decoder to translate the fraction to
and from decimal, and to reverse and trim digits. Representation of
originally binary fractional data may be very inefficient.

A binary fraction method can represent the same exact numeric values and more.
It will often require one or two more octets, but with compression repetitious
denominator values will tend to disappear. In addition, it naturally carries
precision as well as numeric value.

The contrast between the two methods is particularly evident in the very
common case of binary scientific or multimedia data prescaled to the range
(0 - 1] or [-1 - 1].


As background, consider that the XML specification does not mention any
representation for numeric values. All values in XML are character strings, in
keeping with the goal that XML be human readable. Data types for numeric
values appear in schema specifications, and most of the EXI datatypes are
(quite properly) motivated by those in XML Schema. Since XML Schema specifies
datatype representation in XML and therefore character strings that are
supposed to be human readable, it is natural that numeric values are specified
to be in decimal notation.

My point here is that the XML Schema datatype decimal and its derivatives
(integer etc.) are intended to represent the rational numbers. The spec says,
"a subset of the real numbers, which can be represented by decimal numerals",
but the restriction to decimal is an artifact of the necessity of using a
human readable character representation. I wouldn't argue that its necessary
to include those rational numbers that can't be represented in decimal, but
just that its not necessary to cater to XML Schema's restrictions.


Most of the use cases in the Binary Characterization reports tend to either
use integers or express values in floating point. I think EXI will open the
door to many other uses that now have ad-hoc formats. Still, there are two
use cases in the reports that are good examples of the need for rational
number values. The financial data use case requires a way to express exact
decimal fractional values of currency. The sensor data use case might have
need of a simple way to express prescaled binary data.

The following table shows a comparison of compactness between the draft
method using decimal based encoding of the fraction and a binary method
using the same sequence of EXI Boolean, Integer and Unsigned Integer but
where the Integer is the numerator and the Unsigned Integer is the
denominator which, when divided, yield the rational number value.

The finance test datasets consist of numbers between 1 and 1000000, with
more smaller numbers than larger, to a precision of 2 or 3 places after
the decimal point.

The binary test datasets consist of binary values of 8, 12 or 16 bits with
the binary point at the far left so the values fall in the range (0-1].
There are two cases, fixed precision where all bits are retained and
"adaptive" precision where trailing zeroes are trimmed. (The finance datasets
and draft method are both always adaptive.)

The first four columns of results give the average number of bits required
to express a number in the dataset, with or without compression. The last
two columns show the percentage advantage of the more compact representation,
where parentheses indicate the advantage goes to the EXI draft method.

                               Encoded         Compressed       Encoded   Compressed
Dataset                   Decimal   Binary   Decimal  Binary    Advantage Advantage
Finance Cents               21.8     28.8     22.6     23.2      (32.1)     (2.7)
Finance Mils                28.8     39.4     27.3     27.5      (36.8)     (0.6)
Binary  8-bits fixed        38.4     29.0     14.3     12.9       32.4      11.0 
Binary 12-bits fixed        54.3     32.7     21.0     17.1       66.1      22.5 
Binary 16-bits fixed        69.3     47.0     37.0     23.1       47.4      60.5 
Binary  8-bits adaptive     38.4     25.0     14.3     12.7       53.6      12.1 
Binary 12-bits adaptive     54.4     32.0     21.0     17.4       70.0      20.7 
Binary 16-bits adaptive     69.3     43.9     37.0     23.4       57.9      57.9 

  Source material for this test is at http://www.bx-lib.org

The results clearly show that each method works best with its own kind of data. 
But the draft method does much worse on binary data than the binary method does
on decimal data. And with compression the binary method never performs poorly.

The binary method is simple and requires no more bits in the processing than
are in the data. The draft method is more complex, and can only process up to
9-bit binary data in 32-bit arithmetic with straightforward algorithms. This
shows that in addition to being slightly more complex, it is also fragile.
Implementations using the obvious algorithms are likely to fail on certain
kinds of data.


Based on these arguments and examples I feel the working group should reconsider
the representation and maybe the name of the EXI Decimal datatype. I would
propose a binary fraction as an alternative representation, as I haven't been
able to think of anything better in terms of simplicity and efficiency.

Paul Pierce
Received on Friday, 12 October 2007 20:35:07 UTC