Re: B.1 and B.2 results from Gavin Nicol on 1996-10-22 (w3c-sgml-wg@w3.org from October 1996)

From: Gavin Nicol <gtn@ebt.com>
Date: Tue, 22 Oct 1996 17:34:59 -0400
To: U35395@UICVM.UIC.EDU
CC: w3c-sgml-wg@w3.org
Message-Id: <199610222134.RAA02832@nathaniel.ebt>
>The biggest drawback I see, however, is that defining XML entities as
>beginning with a MIME header means that no existing SGML parser can
>be used as is on XML documents.  Every parser will require either a
>prosthetic filter to strip the MIME header off, or a modification to
>make it understand and handle the MIME header as a packaging device.
>Every one.
>
>That, for me, is a show-stopper.

This depends on whether this will be a generally useful (ie. widely
used) feature in the future.

>I also think it needs to be in a form that users can produce
>using their normal tools, without jumping through hoops; that seems
>to mean it needs to be in the same character set it's declaring.

Coded character set *and* encoding.

>Gavin, and now David, have repeatedly claimed that the PI label
>relies on a vicious circle:  you have to know what it says to read
>it. 

It's true. You have to sniff at the data, and the sniffing may not
always succeed. That's reason #1 for calling it a hack. A more
objectional one is that you will *require* people for add to
their data (a header pretending to be data). This may seem pedantic,
but I find this *semantically* objectionable, or counterintuitive.

>Gavin and David have pointed out, correctly, that it is possible to
>construct a coded character set for which the PI label is not
>unambiguous.  This would involve an encoding for which some, but not
>all, of the characters A to Z and a to z would share positions with
>ASCII or EBCDIC or ISO 10646, while the rest would be rearranged so
>as to render it possible to misread an XML character-encoding
>declaration without detecting the misreading.  This strikes me as a
>low-probability development, given the importance of ASCII (er,
>I mean ISO 646!), but it is indubitably possible.

You miss one important case: the case where there is no ASCII
compatability area in the lower 127 code points. This will also fail
in that you will be unable to parse it. I forget what exactly they
are, but there *are* such encodings in existance (Rick. do you
remember of JOHAB is one?)

Another case (also of low probability) is having a file that is
encoded in a manner that might confuse the sniffing logic (eg. a
compressed file who's header looks like the signature for UCS-2). 

>Losing the entire notion of in-file labels would (a) expose XML
>processors to undetectable errors when external metadata is faulty or
>missing, (b) allow the user of arbitrary character encodings
>(implementor is responsible for getting it right, it's not our
>problem), (c) allow us to end this discussion before it crosses the
>boundary from the laughable to the intolerable.

If you replace "in-file" with "in-data", this is my preferred
method. Meta-data should live *beside* the data, not *inside* it.
Let's a header a header.

A PI by any other name would parse as well...
Received on Tuesday, 22 October 1996 17:36:38 UTC