[Prev][Next][Index][Thread]

Re: Reads like ASCII (was Re: character sets ...)



>1) the first thing in the document before any non-ISO 646 characters is a
>PI with only ISO 646 characters that can say the encoding (if it is exotic
>or warranted). E.g.:
>	<?XML EUC-JP>
>2) the encoding used for the input stream must have ISO 646 characters
>in the same code numbers as ISO 646.  

This is a hack, and doesn't help with *initial* parsing of the
document.

Autodetection also fails very quickly when faces with a number of
multibyte encodings. 

The only *correct* way to indicate the encoding (or BCTF) of a
document is to do so external to the document. To me, this means FSI's
*or* MIME labelling (the *.mim file format).

So far, all of the proposal I have seen could be easily handled by the
*.mim file format, in which case, no parser trickery is needed: the
storage manager would always unambiguously know what the encoding is.



Follow-Ups: References: