Re: B.1 and B.2 results from Michael Sperberg-McQueen on 1996-10-22 (w3c-sgml-wg@w3.org from October 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
Date: Tue, 22 Oct 96 11:49:51 CDT
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199610221738.NAA05291@www10.w3.org>
On Tue, 22 Oct 1996 00:18:54 -0400 David G. Durand said:
>PS does anyone else have an opinion on this?

It seems to me that David and Lee have put the case for internal
headers fairly well.  And as they and Gavin have pointed out, there's
a strong case for MIME compatibility; at the very least, XML should
not require external metadata to be ignored.  (The decision as
announced never did require that, but perhaps the spec should say
explicitly that it's *allowed*, or even *recommended*, to use
standard metadata channels like MIME where applicable, in order to
avoid a repetition of the explosive misunderstandings demonstrated on
this list.)

The proposal for in-file MIME headers strikes me as having all the
problems David suggested, most significantly the incompatibility
between the ASCII of the MIME header and the character set which may
be used in the rest of the file.  This is not an issue for some coded
character sets, but certainly is for others, including the canonical
forms of ISO 10646.  I cannot conveniently, in a UCS-2 editor,
generate a file part in ISO 646 and part in UCS-2.

The biggest drawback I see, however, is that defining XML entities as
beginning with a MIME header means that no existing SGML parser can
be used as is on XML documents.  Every parser will require either a
prosthetic filter to strip the MIME header off, or a modification to
make it understand and handle the MIME header as a packaging device.
Every one.

That, for me, is a show-stopper.

If there is an in-file header, I think it needs to be in a format
SGML processors can now handle; hence the idea of using PI syntax for
it.  I also think it needs to be in a form that users can produce
using their normal tools, without jumping through hoops; that seems
to mean it needs to be in the same character set it's declaring.

The main arguments against the PI format appear to be (a) that, in
Gavin's words, "it is a hack", which I take to  mean, in neutral
terms, that Gavin does not approve of it, and (b) that it cannot
be read successfully without external knowledge.  Against the first
argument, no rejoinder is possible.  Against the second, it may be
pointed out that the claim is false.

Gavin, and now David, have repeatedly claimed that the PI label
relies on a vicious circle:  you have to know what it says to read
it.  When I first described the PI-form internal label, I took
pedantic care to show that this is not true:  the PI label is
unambiguous for a variety of existing coded character sets (including
all the ones people had suggested for XML use, plus a few more
including EBCDIC).

Gavin and David have pointed out, correctly, that it is possible to
construct a coded character set for which the PI label is not
unambiguous.  This would involve an encoding for which some, but not
all, of the characters A to Z and a to z would share positions with
ASCII or EBCDIC or ISO 10646, while the rest would be rearranged so
as to render it possible to misread an XML character-encoding
declaration without detecting the misreading.  This strikes me as a
low-probability development, given the importance of ASCII (er,
I mean ISO 646!), but it is indubitably possible.

It seems to me that it's more useful to ask whether the internal PI
label will be ambiguous for any character set now in reasonably wide
use or likely to be developed by anyone not seeking specifically
to undermine the use of internal labels.

Gavin has suggested that it *is* ambiguous in this way, but has
not named any particular pair of encodings for which the PI label
does not work successfully.  When he first made this claim, I went
back to check the JIS X 0208, Shift-JIS, and EUC encodings, to see
whether they would work with the internal PI label, as well as the
ASCII, ISO 8859-*, EBCDIC, UCS-2, UCS-4, and UTF-8 encodings already
examined.  They do.  As Gavin has pointed out (in support of ASCII
MIME headers), *all* the major East Asian encodings will work, for
the same reason:  they all read and produce ISO 646 / ASCII text in
forms identical to ISO 646.


So it seems to me that in all foreseeable practical cases, an in-file
PI character set label is (a) parseable, (b) compatible with existing
SGML processors, and (c) not inherently incompatible with the use of
external metadata channels.  If the fact that it is not MIME is a
show-stopper for enough of us, then we can consider other
alternatives.

An in-file MIME header would be (a) parseable, (b) compatible with
external metadata, (c) incompatible with existing SGML processors,
and (d) in some cases hard or impossible to create using standard
text editing tools.

Losing the entire notion of in-file labels would (a) expose XML
processors to undetectable errors when external metadata is faulty or
missing, (b) allow the user of arbitrary character encodings
(implementor is responsible for getting it right, it's not our
problem), (c) allow us to end this discussion before it crosses the
boundary from the laughable to the intolerable.


-C. M. Sperberg-McQueen
Received on Tuesday, 22 October 1996 13:39:05 UTC