Re: Reads like ASCII (was Re: character sets ...) from Gavin Nicol on 1996-09-16 (w3c-sgml-wg@w3.org from September 1996)

From: Gavin Nicol <gtn@ebt.com>
Date: Mon, 16 Sep 1996 15:02:03 GMT
To: ricko@allette.com.au
CC: tbray@textuality.com, w3c-sgml-wg@w3.org
Message-Id: <199609161502.PAA12652@wiley.EBT.COM>

>1) the first thing in the document before any non-ISO 646 characters is a
>PI with only ISO 646 characters that can say the encoding (if it is exotic
>or warranted). E.g.:
>	<?XML EUC-JP>
>2) the encoding used for the input stream must have ISO 646 characters
>in the same code numbers as ISO 646.  

This is a hack, and doesn't help with *initial* parsing of the
document.

Autodetection also fails very quickly when faces with a number of
multibyte encodings. 

The only *correct* way to indicate the encoding (or BCTF) of a
document is to do so external to the document. To me, this means FSI's
*or* MIME labelling (the *.mim file format).

So far, all of the proposal I have seen could be easily handled by the
*.mim file format, in which case, no parser trickery is needed: the
storage manager would always unambiguously know what the encoding is.

Received on Monday, 16 September 1996 11:03:42 UTC