Binary module - number endianness

*Warning - heated arguments may ensue... *

In the first draft for Binary Module <http://expath.org/spec/binary>, 
conversion between binary and numeric (integer, float, double) forms 
were proposed to default to 'little-endian', with an extended form of 
the conversion functions (additional argument) supporting big-endian 
storage in the binary form, e.g. |bin:unpack-float($bin)| is 
'little-endian', |bin:unpack-float($bin,true())| is 'big-endian'.) There 
appear to be no comments raised about this, that I can find. Given that 
the choice of default influences 'code length' quite considerably (or we 
could define suitably named wrappers), there should be some 
discussion/consensus of which default choice should be made.

One way to think about it is where the 'number' that has to be packed or 
unpacked arose from, or is going to. One can argue that in the 
environment of XLST/XQuery/XPath execution, any endianess in the machine 
numbers defined as constants or computed as |xs:number| types is 
irrelevant, as they have no sense of 'address' accessible through the X* 
execution model.  Whilst the x86 architecture is little-endian, and ARM 
is now bi-endian, does this have significant impact on performance at 
the higher levels of the execution models we anticipate for this module?

If the binary data is being consumed from some outside source or written 
to the same, then we have a variety of different contexts:

  * Numbers in network data (RFC 1700
    <http://www.ietf.org/rfc/rfc1700.txt>) are usually expected to be
    big-endian.
  * Image formats vary: JPEG and PNG
    <http://www.w3.org/TR/2003/REC-PNG-20031110/#7Integers-and-byte-order>
    are big, BMP and GIF are small, TIFF can be either (and indicates
    which type with a specific palindromic marker)
  * Audio and video formats can vary seriously.
  * Some formats that appear binary(ish) don't have endian issues: e.g.
    Postscript and PDF have (uncompressed) numbers encoded as ASCII
    (decimal) strings.


      Other notes:

In cases where endianess can vary between data instances, such as TIFF, 
some global or tunnelled variable (XSLT) could be set and referenced, e.g.:

    |<xsl:variable name="BIG"
    select="bin:subsequence($tiff,/location/,/2/) = bin:hex(|'4D4D')"/>
    ...
    bin:unpack-unsigned-integer($tiff,/$loc/,/$len/,/$BIG/)......

or in XPath3.0, curried functions could be used:

    ...
    <xsl:variable name="bin:unpack-uint"
    select="bin:unpack-unsigned-integer(?,?,?,$BIG)"/>
    ...
    $bin:unpack-uint($tiff,/$loc/,/$len/)......

I assume binary-order-marker (BOM) labelling of encoded XML is not 
relevant to this issue, as they won't be generated or consumed in a 
binary manner. Unless of course a multi-encoding data source is 
encountered, when binary pre-splitting may be required.


      Question:

So the question is: what are use cases that would make /extensive/ use 
of numeric packing and unpacking into binary file forms, presumably for 
interfacing with other (non-network?) applications? And what are the 
endianness defaults for such applications?

[My preference would be big-endian, if only for network-order 
compatibility and also given that the proposed 'string-constant' 
functions |bin:hex('FACE9D78')|, etc. will treat their numbers as 
big-endian...]

*John Lumley* MA PhD CEng FIEE
john@saxonica.com <mailto:john@saxonica.com>
on behalf of Saxonica Ltd

Received on Monday, 15 July 2013 16:19:36 UTC