Encoding PIs recap. (Re: Comments on 31 March spec) from Rick Jelliffe on 1997-04-10 (w3c-sgml-wg@w3.org from April 1997)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Thu, 10 Apr 1997 20:46:42 +1000
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-ID: <334CC512.3E60@allette.com.au>

Michael Sperberg-McQueen wrote:
> 
> On Thu, 3 Apr 1997 08:25:59 -0500 Martin Bryan said:

> >For 4.3.3 shouldn't a statement be added that EncodingPI must be encoded
> >in UTF-8 0r be proceeded by #xFEFF if encoded in UCS-2? (Allowing it to
> >be encoded in any other way would give interoperability problems.)
>...
> The EncodingPI itself is *not* required to be encoded in UTF-8; that
> suggestion was made last fall, and failed to generate consensus.
> The EncodingPI is written in the same encoding as the rest of the file,
> because one of the main advantages of in-file headers of this sort
> is that they can be maintained directly by users, without reliance on
> anything more elaborate than an editor that understands the encoding
> in use in the file.  

It might be a good idea to put in the spec that the PI should only
contain characters that have ISO 10646 encodings of < 128 
(i.e. the ISO 646 characters, ASCII, as used in MIME headers and
SGML minimum literals).  I would ask the editorial board to consider it.

The PI rationale has been discussed before: if you require a 
particular encoding for the PI, you are creating a non-text header.
You need special applications. Gavin's .MIM format would be far 
superior than the Encoding PI for this.  (I hope that people 
writing storage/entity managers and file systems look at the 
.MIM format.)

The Encoding PI, if limited to ISO 646 repertoire, is editable in
a local vanilla text editor, as Michael has said. And (because all(?) 
national/WWW character sets have the ISO 646 characters in their ISO 646 
code positions) its Encoding PI could even be made out using a text 
editor with a different character set (or at worst by any utility that
can 
do an ASCII dump of a file, e.g. UNIX od utility, or XTREE).
The ASCII characters would typically either be consecutive bytes 
(e.g. ISO 8859-n, shift-JIS, EUC, UCS-8) or preceded by null bytes
(e.g. Unicode, EUC-fixed): the (dreaded) autodetection is trivial.

All applications that generate XML files should generate the Encoding
PI, especially if the character set is not ISO 646, ISO 8859-1 or 
ISO 10646 BMP, or if in a nation with multiple character sets 
(e.g. Japan, China).  All XML applications that process XML files should 
strip the Encoding PI on import, and regenerate it appropriately on
export. 

If an XML file with one encoding or character set is edited by an 
editor with a different encoding or character set, (e.g. an ISO 8859-1
character set file is edited by an ISO 8859-2 editor) the encoding PI 
should be stripped (that is, the encoding attribute) unless the 
application has transcoded on import. In other words, 
every effort should be made to make sure that the Encoding PI is
accurate: 
no PI is better than a wrong one, IMHO. 

-Rick Jelliffe

Received on Thursday, 10 April 1997 06:41:20 UTC