Re: Objection to hexBinary and base64Binary from C. M. Sperberg-McQueen on 2001-04-25 (www-xml-schema-comments@w3.org from April to June 2001)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Wed, 25 Apr 2001 15:26:44 -0600
To: Noah_Mendelsohn@lotus.com
Cc: cbf@isovia.com, www-xml-schema-comments@w3.org
Message-Id: <4.3.2.7.1.20010425144832.02790f10@espanola.com>

At 2001-04-19 17:10, Noah_Mendelsohn@lotus.com wrote:
>Michael Sperberg-McQueen writes:
>
> >> 2 schema authors can define a 'binary' type as a union
> >>      of the hex and base64 types, so they can in fact
> >>     just say 'binary' if they wish
>
>Dangerous, I think.  Doesn't  7A8B mean different things in the two
>encodings?  The union is unlikely to do what you expect.

Yes, it does mean different things: 0111 1010 1000 1011 in hex, 111011
000000 111100 000001 in base 64.  In production systems, I would
expect users normally to be instructed always to specify which
encoding is used (using xsi:type) in order to ensure that any
ambiguity is resolved correctly.

In actual use, however, the ambiguity does not seem likely to be as
big a problem as one might fear. It is statistically unlikely that any
binary value of any length would be ambiguous in practice, because the
probabilities are against (for longer values, very strongly against)
any base64-encoded binary value being a legal hex-encoding value.

Assume that a binary value has random length (measured in octets).
There is then at least a 2/3 chance that the value is not a legal hex
encoding, because for n octets, if (n mod 3) is 1 or 2 the base-64
value will end in "=".  For octet strings with (length mod 3) = 0, and
random bit patterns, the chances that the string's legal base-64
encoding will also be a legal hex encoding are 1/4 to the power (n * 4
/ 3).  For a binary value of, say, fifteen octets (representing, let
us say, a character image for an old PC- or terminal-based font) the
chances that a random value will have an ambiguous base-64 encoding
are 1 / 4 ** 20, or 9.094947017729282e-013.

So on the whole, I think a union type with hex first and base 64
second is safer than it looks.  With short strings like the one you
specify, the ambiguity between hex and base 64 is more serious (there
is almost one half of a one percent chance that a random 24-bit string
will have an ambiguous base-64 encoding), but with longer strings I
think the union type hypothesized in my note to Charles has a
reasonably good chance of correct handling even if a user accidentally
forgets to specify xsi:type.

>Also:  I noticed while checking this that neither form of binary has a
>canonical form.  Doesn't hex allow a choice of upper/lowercase for the
>alphabetics, and what about embedded whitespace in the base 64 (the RFC
>seems to allow embedded white space...I presume we do too?)  If I am
>right, I think these should be added to the list of issues for the
>erratum.

The WG agreed the other day to make uppercase the canonical form for
hex encoded binary.  I agree that defining a canonical form for base 64
would be a desirable thing for the errata list.

-CMSMcQ

Received on Wednesday, 25 April 2001 17:42:50 UTC