- From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
- Date: Tue, 22 Oct 96 11:49:51 CDT
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
On Tue, 22 Oct 1996 00:18:54 -0400 David G. Durand said: >PS does anyone else have an opinion on this? It seems to me that David and Lee have put the case for internal headers fairly well. And as they and Gavin have pointed out, there's a strong case for MIME compatibility; at the very least, XML should not require external metadata to be ignored. (The decision as announced never did require that, but perhaps the spec should say explicitly that it's *allowed*, or even *recommended*, to use standard metadata channels like MIME where applicable, in order to avoid a repetition of the explosive misunderstandings demonstrated on this list.) The proposal for in-file MIME headers strikes me as having all the problems David suggested, most significantly the incompatibility between the ASCII of the MIME header and the character set which may be used in the rest of the file. This is not an issue for some coded character sets, but certainly is for others, including the canonical forms of ISO 10646. I cannot conveniently, in a UCS-2 editor, generate a file part in ISO 646 and part in UCS-2. The biggest drawback I see, however, is that defining XML entities as beginning with a MIME header means that no existing SGML parser can be used as is on XML documents. Every parser will require either a prosthetic filter to strip the MIME header off, or a modification to make it understand and handle the MIME header as a packaging device. Every one. That, for me, is a show-stopper. If there is an in-file header, I think it needs to be in a format SGML processors can now handle; hence the idea of using PI syntax for it. I also think it needs to be in a form that users can produce using their normal tools, without jumping through hoops; that seems to mean it needs to be in the same character set it's declaring. The main arguments against the PI format appear to be (a) that, in Gavin's words, "it is a hack", which I take to mean, in neutral terms, that Gavin does not approve of it, and (b) that it cannot be read successfully without external knowledge. Against the first argument, no rejoinder is possible. Against the second, it may be pointed out that the claim is false. Gavin, and now David, have repeatedly claimed that the PI label relies on a vicious circle: you have to know what it says to read it. When I first described the PI-form internal label, I took pedantic care to show that this is not true: the PI label is unambiguous for a variety of existing coded character sets (including all the ones people had suggested for XML use, plus a few more including EBCDIC). Gavin and David have pointed out, correctly, that it is possible to construct a coded character set for which the PI label is not unambiguous. This would involve an encoding for which some, but not all, of the characters A to Z and a to z would share positions with ASCII or EBCDIC or ISO 10646, while the rest would be rearranged so as to render it possible to misread an XML character-encoding declaration without detecting the misreading. This strikes me as a low-probability development, given the importance of ASCII (er, I mean ISO 646!), but it is indubitably possible. It seems to me that it's more useful to ask whether the internal PI label will be ambiguous for any character set now in reasonably wide use or likely to be developed by anyone not seeking specifically to undermine the use of internal labels. Gavin has suggested that it *is* ambiguous in this way, but has not named any particular pair of encodings for which the PI label does not work successfully. When he first made this claim, I went back to check the JIS X 0208, Shift-JIS, and EUC encodings, to see whether they would work with the internal PI label, as well as the ASCII, ISO 8859-*, EBCDIC, UCS-2, UCS-4, and UTF-8 encodings already examined. They do. As Gavin has pointed out (in support of ASCII MIME headers), *all* the major East Asian encodings will work, for the same reason: they all read and produce ISO 646 / ASCII text in forms identical to ISO 646. So it seems to me that in all foreseeable practical cases, an in-file PI character set label is (a) parseable, (b) compatible with existing SGML processors, and (c) not inherently incompatible with the use of external metadata channels. If the fact that it is not MIME is a show-stopper for enough of us, then we can consider other alternatives. An in-file MIME header would be (a) parseable, (b) compatible with external metadata, (c) incompatible with existing SGML processors, and (d) in some cases hard or impossible to create using standard text editing tools. Losing the entire notion of in-file labels would (a) expose XML processors to undetectable errors when external metadata is faulty or missing, (b) allow the user of arbitrary character encodings (implementor is responsible for getting it right, it's not our problem), (c) allow us to end this discussion before it crosses the boundary from the laughable to the intolerable. -C. M. Sperberg-McQueen
Received on Tuesday, 22 October 1996 13:39:05 UTC