Re: Draft of Binary module from Adam Retter on 2013-03-13 (public-expath@w3.org from March 2013)

From: Adam Retter <adam@exist-db.org>
Date: Wed, 13 Mar 2013 17:20:34 +0000
To: Michael Kay <mike@saxonica.com>
Cc: public-expath@w3.org
Message-ID: <CAJKLP9bAMu4LwtKH+OLJyYNuN-m=jzm=tdGtbmuQb95P2VRKcQ@mail.gmail.com>
A second stream of consciousness:

I feel very uneasy about representing octets using integers. So I ask
myself why? Is it history - i.e. eXist and Zorba (maybe others)
represent octets as octal strings and allow for the conversion of
octal to/from decimal with functions.

An octet is implicitly in Base 8, so why would I then want to
manipulate it as thought it were Base 10 (but without converting it to
Base 10). This just doesnt make sense to me -

If I understand correctly -

bin:binary-to-octets(xs:hexBinary("FFFF"))  would give me (255, 255)

The problem is that I now have two Base 8 values in a Base 10 data
type, and all of the operators I have to work with this stuff
(xs:integer) operates on Base 10 and so will not understand what I am
trying to do with my Base 2 stuff.

Rather if you want to go to base10 then
bin:binary-to-radix(xs:hexBinary("FFFF"), 10) could give me 65535.
At least when its in Base 10 I can do math with it using the standard
XQuery operators and then convert it base to xs:hexBinary if I wish
too. I do not like values in disguise!

All of the bitwise operators seem to operate on xs:hexBinary, so how
would I even work with these "octets" as integers? Perhaps we dont
even need an binary-to-octets function? Is there a use-case?

If we need it, perhaps a better representation of 'octets' is to use a
representation which is in the correct base i.e. base 8, and so which
should use an octal string (which could start with a 0 or similar)
e.g. -
bin:binary-to-octets(xs:hexBinary("FFFF")) would give me "0177777"

I recognise that the underlying problem is that XQuery does not have a
'byte' type, however using xs:integer for this feels wrong to me
somehow (at least if you dont convert to the correct base). Also from
a selfish implementation point of view - using an xs:integer for
representing a single byte is very wasteful in terms of memory.
xs:integer is unbounded in scope, and will typically be at least 32
bits, whereas a byte is just 8 bits and a character in a string can be
just 8 or 16 bits.

On 13 March 2013 14:29, Adam Retter <adam@exist-db.org> wrote:
>>> Wow thats quite comprehensive :-)
>>> I will need to digest it fully yet, but I have a few initial questions -
>>>
>>> 1) Why the use of xs:hexBinary when most other EXPath function
>>> libraries (and in fact most 3rd party XQuery functions) I have seen
>>> use xs:base64Binary? Converting from one to the other is something
>>> that you *really* dont want to have to do, especially for large files!
>>
>> I think converting between hexBinary and base64Binary should be pretty much
>> a no-op for most processors: the internal representation of the value is
>> likely to be an immutable byte array, and conversion just means creating a
>> new wrapper around the byte array. But it's a user inconvenience. Actually
>> for input parameters, I don't see why we shouldn't accept either form.
>
> Hehe well of course it depends on implementation. I would advise from
> experience that you should not keep the data in RAM as an immutable
> byte array, certainly in eXist this is what we used to do, as soon as
> you have a few large files you will quickly run out of memory and
> crash your processor. So what we rather do now, is in fact use an
> InputStream with some clever stuff to make it re-readable at any time,
> keeping a minimal amount in RAM and the rest either on disk or
> elsewhere - this is all open source of course, so if your interested I
> can point you at the code which should be re-useable outside eXist
> too.
>
> My concern regarding the conversion was rather that if a user puts an
> intermediate step between xs:hexBinary and xs:base64Binary, perhaps
> xs:string - it may not be clear to them but they will pay a heavy
> price to encode the raw binary, to hexBinary, decode it and then
> re-encode it as base64Binary. I really wish we could just settle on
> one binary type in XQuery. I would be interested to know a little
> history about why we have two of them Mike?
>
>>> Im just reading through the rest now, my main concern is that these
>>> operations can be done efficiently. I have been re-working the
>>> implementation of the common Java code for the EXPath http module to
>>> support streaming of large binary values and large string values. We
>>> have customers that want to work with binary and text documents that
>>> are several gigabytes each from XQuery.
>>>
>>>
>> Interesting question. I don't know how efficient direct access to binary
>> files is; if it's OK, then one could easily have an internal implementation
>> of a base64Binary value that's mapped directly to a file rather than to
>> memory, and perform all the operations directly on the file.
>
> Yup. See above.
>
>> But if
>> efficiency means maintaining a current position in the file and reading
>> what's at the current position, then that complicates the interface
>> considerably. It could be done using higher-order functions, but would be a
>> bit mind-blowing. Although we've got functions with side-effects in the File
>> module, they are external side-effects, and I'd be reluctant to design
>> anything with internal side-effects, e.g. on the current position of a file
>> handle.
>
> Well with the EXPath HTTP Client module, I did not need to re-design
> the function signatures, merely I adapted the implementation to do
> streaming. So my comment was more, that I would have a look and see if
> there was anything we could not do using the proposed binary function
> signatures in a streaming manner...
>
>> Michael Kay
>> Saxonica
>>
>>
>
>
>
> --
> Adam Retter
>
> eXist Developer
> { United Kingdom }
> adam@exist-db.org
> irc://irc.freenode.net/existdb



--
Adam Retter

eXist Developer
{ United Kingdom }
adam@exist-db.org
irc://irc.freenode.net/existdb
Received on Wednesday, 13 March 2013 17:21:05 UTC