Re: B.1 and B.2 results from Michael Sperberg-McQueen on 1996-10-22 (w3c-sgml-wg@w3.org from October 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
Date: Tue, 22 Oct 96 17:28:21 CDT
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199610222300.TAA07114@www10.w3.org>
On Tue, 22 Oct 1996 17:57:14 -0400 Gavin Nicol said:
>>You cannot recognize the PI, _without having a list of the magic
>>numbers for legal PI definitions_. If a user attempts to use a PI
>>that does not exactly match one of the "the magic number formulas,"
>>then the processor may not even be able to recognize that a PI was
>>present. So the apparent _self-descriptive_ aspect of the data is
>>_not_ there.

>Thank you David. This is a point I have felt, but been unable to
>articulate.

I'm not sure what David means by 'magic numbers' here, but if he
means the IETF-defined values for the MIME charset field (or, XML
Encoding attribute), I don't think this is true at all.

Any XML processor will know what character sets (by which, for now, I
mean 'coded character sets and/or encodings thereof') it can handle.
When it encounters one it doesn't handle, I believe it's likely to
fall into a case like the following:

A.  The processor accepts ISO 8859, UTF-8, and UCS-2.  It gets a
Shift-JIS entity, and says "Sorry; this entity is in a character
encoding called 'Shift-JIS' which I don't handle."  It was able to
read and parse the PI, because in Shift-JIS all the characters in
<?XML encoding='Shift-JIS' ?> are bit-identical to ISO 8859-*.

B.  The processor accepts EBCDIC, UTF-8, and UCS-2.  It gets a
Shift-JIS entity, and says "Sorry; this entity is in a character
encoding called 'Shift-JIS' which I don't handle."  It was able to
read and parse the PI, because in Shift-JIS all the characters in
<?XML encoding='Shift-JIS' ?> are bit-identical to UTF-8.

C.  The processor accepts Shift-JIS, UTF-8, and UCS-2.  It gets an
EBCDIC entity, and says "Sorry; this entity is in a character
encoding which I don't handle.  (There is also a chance that the
entity has been trashed, or isn't in XML.)"  The salient fact about
the entity, which is that it's in an unknown character set or
otherwise unprocessable, can be reliably detected, although the
EBCDIC-encoded string 'ebcdic-cp37' cannot be deciphered.  N.B.
David is right to point out that labels can only be read by those
capable of reading them.  This is clearly a drawback, compared with a
system in which they are always readable, even by those not capable
of reading them.  But the key fact here seems to me very simple, and
accurately conveyed:  this-entity-not-readable.

Quick quiz:  out of the members of the WG currently reading this
(both of you!), how many might be able to tell their browser how to
take corrective action if they knew the unreadable material was in
something called 'ebcdic-cp37'?  How about 'JOHAB'?


>>This is true only for all the character sets that _we precode into
>>XML_. It does not work for any new character set names. The PI looks
>>like it has a parameter, but in fact the PI, and its parameter,
>>constitute a magic string of bytes with no internal structure. This
>>is a bit counterintuitive.

I hope the examples above make clear why I think the limits on a
processor's ability to identify the name of the encoding in use are a
function NOT of the character set names precoded into XML, but of (a)
the families of character sets the processor recognizes and (b) the
family of character sets to which the particular entity in question
actually belongs.

>As is explaining to people that you can do:
>
>   <?XML-ENCODING "SHIFT-JIS">
>   .....
>
>but not
>
>  <?XML-ENCODING "SHIFT-JIS">
>  ....
>  <?XML-ENCODING "UCS2">
>  ....

Well, I may be excessively idealistic, but I had thought "you can't
change character encodings in the middle of a file" would do it for
most readers, with an occasional "Because the software can't handle it"
for the insistent few.

For those of us with jaded stylistic palates and too many technical
standards under our belts, it might be necessary to have a footnote
saying something like "That is, Code extension functions for the ISO
2022 code extension techniques (such as designation escape sequence,
single shift and locking shift), and character-encoding labeling
functions as defined above, may not be used within the body of XML
entities."

>>I do not advocate losing the notion. But if it gets intolerable enough,
>>maybe we can do the right thing after all!

Judging by the response of the WG as a whole, they have already
decided the 'right thing' involves installing bozo filters with our
names on them.  We haven't had a new argument in this discussion for
some time, you're not persuading me, I'm not persuading you, and no
one in their right mind is listening.  Perhaps we should call it a
thread and stop.

Michael
Received on Tuesday, 22 October 1996 19:00:32 UTC