Re: B.1 and B.2 results from David G. Durand on 1996-10-22 (w3c-sgml-wg@w3.org from October 1996)

From: David G. Durand <dgd@cs.bu.edu>
Date: Tue, 22 Oct 1996 15:05:25 -0500
To: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <v02130501ae92cf1849c7@[128.148.157.46]>
At 11:49 10/22/96, Michael Sperberg-McQueen wrote:
>On Tue, 22 Oct 1996 00:18:54 -0400 David G. Durand said:
>>PS does anyone else have an opinion on this?
>The biggest drawback I see, however, is that defining XML entities as
>beginning with a MIME header means that no existing SGML parser can
>be used as is on XML documents.  Every parser will require either a
>prosthetic filter to strip the MIME header off, or a modification to
>make it understand and handle the MIME header as a packaging device.
>Every one.
>
>That, for me, is a show-stopper.

We need to remember that most of the individuals in the world are not using
SGMl software, and that the processing required to strip the header is a 3
line perl hack.

Anyway, since SGML has the general notion of an entity manager, the notion
of an entity header on the storage object fits right into the SGML model.

>If there is an in-file header, I think it needs to be in a format
>SGML processors can now handle; hence the idea of using PI syntax for
>it.  I also think it needs to be in a form that users can produce
>using their normal tools, without jumping through hoops; that seems
>to mean it needs to be in the same character set it's declaring.

It needs to have, at least, the same encoded character length -- as I have
already argued.

>The main arguments against the PI format appear to be (a) that, in
>Gavin's words, "it is a hack", which I take to  mean, in neutral
>terms, that Gavin does not approve of it, and (b) that it cannot
>be read successfully without external knowledge.

A hack (among many other things) is something that is not dependable, or
that relies on tricky relationships between differing interpretations of
the same data or code. (If you accept (b), as Gavin does), these two facts
alone imply (a). PIs are a hack, that depend on epiphenomena of current
coding sets.

Here's a similar hack, that I take as a cononical example of the "character
set hack" genre. You can change the case of letters pretty portably by
XOR-ing them with a space. This works in EBCDIC and ASCII.

>Against the first argument, no rejoinder is possible.

It was so self-evidently a hack to me that I had trouble thinking how to
explain it. The partial attempt above shows at least two properties that I
deem undesirable, and contributory to its hack-nature.

>  Against the second, it may be pointed out that the claim is false.

The claim as you have repeated it is false. The claim that I have made is
not false (I can't speak for Gavin, but I suspect he agrees):

You cannot recognize the PI, _without having a list of the magic numbers
for legal PI definitions_. If a user attempts to use a PI that does not
exactly match one of the "the magic number formulas," then the processor
may not even be able to recognize that a PI was present. So the apparent
_self-descriptive_ aspect of the data is _not_ there. I want internal
headers so I can tell what data is -- If I can't dependably tell if there's
a new kind of header that I don't recognize, it's a much less useful
header. We should at least be able to have the equivalent of a tape
"standard label". Wasn't there a field in there to tell you if it was a
"weird" "ASCII" coded tape?

   Another of the factors that shows that the PI hack is a hack and not a
solution is that it _looks_ extensible, but extending it for a new encoding
will, in fact, break existing software so that it can't even use the header
to explain the problem.

>Gavin, and now David, have repeatedly claimed that the PI label
>relies on a vicious circle:  you have to know what it says to read
>it.  When I first described the PI-form internal label, I took
>pedantic care to show that this is not true:  the PI label is
>unambiguous for a variety of existing coded character sets (including
>all the ones people had suggested for XML use, plus a few more
>including EBCDIC).

This is true only for all the character sets that _we precode into XML_. It
does not work for any new character set names. The PI looks like it has a
parameter, but in fact the PI, and its parameter, constitute a magic string
of bytes with no internal structure. This is a bit counterintuitive.

>Gavin and David have pointed out, correctly, that it is possible to
>construct a coded character set for which the PI label is not
>unambiguous.  This would involve an encoding for which some, but not
>all, of the characters A to Z and a to z would share positions with
>ASCII or EBCDIC or ISO 10646, while the rest would be rearranged so
>as to render it possible to misread an XML character-encoding
>declaration without detecting the misreading.  This strikes me as a
>low-probability development, given the importance of ASCII (er,
>I mean ISO 646!), but it is indubitably possible.

This is part, but not all of the objection. See preceding.

>It seems to me that it's more useful to ask whether the internal PI
>label will be ambiguous for any character set now in reasonably wide
>use or likely to be developed by anyone not seeking specifically
>to undermine the use of internal labels.

It's a tempting item to devise, but _I_ would restrain myself.

>So it seems to me that in all foreseeable practical cases, an in-file
>PI character set label is (a) parseable, (b) compatible with existing
>SGML processors, and (c) not inherently incompatible with the use of
>external metadata channels.  If the fact that it is not MIME is a
>show-stopper for enough of us, then we can consider other
>alternatives.

And also:
    A reinvention of the wheel
    Less flexible than MIME headers
    Does not take advantage of the existing MIME header-parsing facilities
already in every browser on the Web. (I guess this is follow on to
reinventing the wheel. Now we'll need a new kind of axle to spin it on...)
    Is an unfamiliar syntax, compared to headers that everyone has been
seeeing on e-mail for the last 20 years (or whatever)...
    Also note that it is "(a) parseable" only if the character set is one
of the ones wired into your parser.

>An in-file MIME header would be (a) parseable, (b) compatible with
>external metadata, (c) incompatible with existing SGML processors,
>and (d) in some cases hard or impossible to create using standard
>text editing tools.

(d) is not true. I explained how we could use the "multibyte-mode"
determination trick proposed for the PI to make the header readable on all
the existing systems. And the header would enterable as a native sequence
of characters for any one or two-byte character code. These characters
would be user-readable in the existing codings, though not perhaps in some
weird new one (these are the only people for whom creating the header might
be "hard". The header would be machine-readable regardless of encoding.

>Losing the entire notion of in-file labels would (a) expose XML
>processors to undetectable errors when external metadata is faulty or
>missing, (b) allow the user of arbitrary character encodings
>(implementor is responsible for getting it right, it's not our
>problem), (c) allow us to end this discussion before it crosses the
>boundary from the laughable to the intolerable.

I do not advocate losing the notion. But if it gets intolerable enough,
maybe we can do the right thing after all!

   -- David


RE delenda est.
I am not a number. I am an undefined character.
_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
http://www.dynamicdiagrams.com/services_map_main.html
Received on Tuesday, 22 October 1996 15:05:39 UTC