Re: "http://www.w3.org/TR/REC-xml#sec-guessing"

From: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>

> There is necessarily guesswork involved -- not on the part of the
> processor following the algorithm described, but certainly on the
> part of the XML community, in taking as a premise the proposition
> that in practice, the only character encodings with which an XML processor
> will ever be confronted are those which the algorithm successfully
> identifies.
> 
> There is no logical necessity for coded character sets, or character
> encodings, to fall into the class of character sets for which the
> algorithm works.

I think Michael is putting the cart before the horse.  Character encodings
on which the algorithm fails *must* be excluded.  Excluding a category
which has no known members should be a no-brainer: we don't have
IANA encodings which keep "<?xml version="1.0" encoding="
in their ASCII positions but swap around other ASCII character
positions.  

The only encoding issues I am aware of that are remotely close
to causing any funnies are UTF-5 (which might fail but wouldn't
be incorrectly diagnosed) and Japanese variant character sets
(i.e. that use / for Yen) which just need to be labelled correctly.

But there are none known in which the correct code sequence
labelling one encoding shares the same bytes as the byte code
sequence of the header for another encoding. 

So there should be no guesswork because logically possible
shadowing encodings should be excluded.  And even then,
there should be no guesswork, because excluding mythical 
things from consideration does not mean a thing guesswork.
It is possible that there could be a race of giants somewhere,
but none are known.  

Cheers
Rick Jelliffe

Received on Friday, 28 March 2003 00:57:52 UTC