re: Determination of Encoding from Gavin Nicol on 1997-06-23 (w3c-sgml-wg@w3.org from June 1997)

From: Gavin Nicol <gtn@eps.inso.com>
Date: Mon, 23 Jun 1997 18:09:06 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <199706232209.SAA01696@nathaniel.eps.inso.com>

Michael Sperberg-McQueen:
>On Mon, 23 Jun 1997 14:13:44 -0400 (EDT) Gavin Nicol said:
>[quoting me]:
>>>The best that can be hoped for is to have some chance at noticing that
>>>there is a discrepancy -- particularly important given the frequency
>>>with which transcoders garble the data (at least ASCII/EBCDIC
>>>transcoders do -- perhaps the transcoders for CJK character encodings
>>>work flawlessly all the time).
>>>
>>>To do that, you need to have the PI retained.
>>
>>Most receiving systems will be able to parse the PI and detect the
>>difference, sure. The problem is that the trancoding *server* cannot
>>stop them from getting false negatives unless it rewrites the PI. The
>>probability of HTTP being changed to require this for XML is
>>vanishingly small. I believe it to also be vanishingly small for any
>>MIME based protocol (including email).
>
>As has been pointed out (by me, last fall), even rewriting the MIME
>headers is not always performed correctly -- especially in email.
>My incoming email, on this EBCDIC machine, is full of MIME-encoded mail
>claiming, in its MIME headers, to be in ASCII.

Are those systems conformant MIME implementations? Bad software is bad
software, and should be fixed.

>Under these circumstances, I don't see the point in *requiring* any
>processor to prefer the internal to the external label, or vice versa.
>Anyone who argues that one of these will always be right in cases of
>conflicts must be living in a world rather unlike mine.

>>Taking the failure cases and making them canonical doesn't remove the
>>problem: it just increases the number of failures.
>
>This is probably an argument against the proposal recently mooted, to
>declare inconsistency between the internal and external labels a
>well-formedness error.  I'd be inclined to continue to define it as
>an error, however -- but it should be an error from which it's possible
>to recover.

Making it a well-formedness error would result in people ignoring the
spec. I believe. 

How about removing this section of the specification:

   An entity which begins with neither a Byte Order Mark nor an
   encoding declaration must be in the UTF-8 encoding. 

which effectively *requires* transcoding proxies to rewrite the PI, or
use UTF-8 (both of which are not likely to happen). Without this, at
least there is some appearance of equality. 

I also dislike the specific recommendation of any value for the
encoding spelcification, and especially those related to Japanese. Can
we do without them, or simply state that they must be one of the IANA
names?

Received on Monday, 23 June 1997 18:09:47 UTC