- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Wed, 25 Apr 2001 15:26:44 -0600
- To: Noah_Mendelsohn@lotus.com
- Cc: cbf@isovia.com, www-xml-schema-comments@w3.org
At 2001-04-19 17:10, Noah_Mendelsohn@lotus.com wrote: >Michael Sperberg-McQueen writes: > > >> 2 schema authors can define a 'binary' type as a union > >> of the hex and base64 types, so they can in fact > >> just say 'binary' if they wish > >Dangerous, I think. Doesn't 7A8B mean different things in the two >encodings? The union is unlikely to do what you expect. Yes, it does mean different things: 0111 1010 1000 1011 in hex, 111011 000000 111100 000001 in base 64. In production systems, I would expect users normally to be instructed always to specify which encoding is used (using xsi:type) in order to ensure that any ambiguity is resolved correctly. In actual use, however, the ambiguity does not seem likely to be as big a problem as one might fear. It is statistically unlikely that any binary value of any length would be ambiguous in practice, because the probabilities are against (for longer values, very strongly against) any base64-encoded binary value being a legal hex-encoding value. Assume that a binary value has random length (measured in octets). There is then at least a 2/3 chance that the value is not a legal hex encoding, because for n octets, if (n mod 3) is 1 or 2 the base-64 value will end in "=". For octet strings with (length mod 3) = 0, and random bit patterns, the chances that the string's legal base-64 encoding will also be a legal hex encoding are 1/4 to the power (n * 4 / 3). For a binary value of, say, fifteen octets (representing, let us say, a character image for an old PC- or terminal-based font) the chances that a random value will have an ambiguous base-64 encoding are 1 / 4 ** 20, or 9.094947017729282e-013. So on the whole, I think a union type with hex first and base 64 second is safer than it looks. With short strings like the one you specify, the ambiguity between hex and base 64 is more serious (there is almost one half of a one percent chance that a random 24-bit string will have an ambiguous base-64 encoding), but with longer strings I think the union type hypothesized in my note to Charles has a reasonably good chance of correct handling even if a user accidentally forgets to specify xsi:type. >Also: I noticed while checking this that neither form of binary has a >canonical form. Doesn't hex allow a choice of upper/lowercase for the >alphabetics, and what about embedded whitespace in the base 64 (the RFC >seems to allow embedded white space...I presume we do too?) If I am >right, I think these should be added to the list of issues for the >erratum. The WG agreed the other day to make uppercase the canonical form for hex encoded binary. I agree that defining a canonical form for base 64 would be a desirable thing for the errata list. -CMSMcQ
Received on Wednesday, 25 April 2001 17:42:50 UTC