Re: XML design errors? from C M Sperberg-McQueen on 1998-06-19 (xml-editor@w3.org from April to June 1998)

From: C M Sperberg-McQueen <cmsmcq@uic.edu>
Date: Fri, 19 Jun 1998 17:55:34 -0500
To: Chris.Newman@innosoft.com
CC: connolly@w3.org, xml-editor@w3.org, cmsmcq@uic.edu
Message-Id: <199806192255.RAA171732@tigger.cc.uic.edu>
>Date: Fri, 19 Jun 1998 13:36:30 -0700 (PDT)
>From: Chris Newman <Chris.Newman@innosoft.com>
>> Can anyone give an example of an interoperability problem introduced
>> by the notion of processing instructions that could not occur without
>> them?
>
>Vendor A uses a PI which alters the processing of the document.  Vendor
>A's product generates documents relying on that PI.  Take documents from
>Vendor A to Vendor B (which ignores the PI), and the documents don't look
>the same.  Since both Vendor A and Vendor B are compliant, the result is
>a legal interoperability problem.

Sorry, but I don't follow.  In the case you describe, the products
from vendors A and B are not interoperating at all, successfully or
unsuccessfully; they are working, independently, on the same data.

Whenever two programs work on the same data, even if they are doing
'the same thing' (e.g. both are displaying the data), they may produce
different results; it is a fundamental assumption of W3C, as I 
understand it, that products should be allowed to differentiate
themselves by producing better results than their competition.
Dan Connolly can correct me if I am wrong in this.

In your example, the difference in results is not introduced by the
use of processing instructions; we can imagine very similar scenarios
with the same bottom line, in which processing instructions do not
appear at all.

  - Vendor A sells a browser that uses algorithm I for font fallbacks.
Vendor B uses algorithm B.  Take documents from one to the other;
the documents don't look the same.  
  - Vendor A sells a browser that understands the XML:lang attribute
and hyphenates the text accordingly, in order to get better justification
of lines.  Vendor B sells a monolingual program that assumes all text
is written in French.  Take documents from Vendor A to Vendor B; they
don't look the same.
  - Vendor A sells a browser with a built-in style sheet for HTML;
so does Vendor B.  Take documents from Vendor A to Vendor B; they 
don't look the same.

An information owner who wishes to ensure that all critical aspects of
document processing rely exclusively on element types, attribute
values, and position in the document tree can readily do so, and would
be wise to do so; the existence of processing instructions does not
bear on this fact.

>> >* "<![CDATA[" notation is cumbersome and creates new parser state and
>> >  alternate representations.
>> 
>> It's much less cumbersome than the alternative, which is to escape
>> each delimiter in the block individually.
>
>This is the same mistake which was made in html with the <PLAINTEXT>
>(I forget the exact name used) tag.  That tag was obsoleted in favor of
>the <pre> tag.  A similar mistake was made in an early draft of the
>text/enriched media type, but was corrected in the final version.
>
>It turns out it's easier and cleaner to have one parser state and quote
>characters appropriately than it is to have two parser states with
>different quoting conventions.  Especially if the second parser state is
>infrequently used, it causes no end of bugs, complications and problems.

I'm sorry, but this seems to assume development by programmers who
don't understand how to read formal grammars and don't test their code
well.  The experience of SGML users over the past decade or so has
been that CDATA marked sections are convenient in some applications; I
do not recall that CDATA sections have been any more bug-prone in SGML
systems than other parts of the spec; I can remember a number of
problems with various systems, but none of the bugs I've encountered
have ever involved CDATA sections.

There are a number of problems with the old HTML PLAINTEXT element, as
there are with the PRE element.  One of them is the implausible
assumption that the segment of the data stream for which special
parsing is required should always be coterminous with an SGML element.

>Because text/enriched went through a public review process in the IETF, 
>this problem was identified and eliminated before it was published.  Shame
>that XML lacked a similar public review process.

Drafts of the XML spec were available to the public from November, 1995,
continuously through the date XML became a Recommendation.  Comments
were in fact received from the public, considered by the work group, 
and in some cases acted upon.

It's not your responsibility to know the details of XML's development
process, so I don't blame you for not knowing that.  But if you don't
know how XML was developed, then surely you ought to realize that you
don't know.  In which case, why are you making any claims at all about
XML's development process?

>> >* Version number text is broken -- likely to leave things stuck at
>> >  "1.0" just like MIME-Version.
>> 
>> How?   ...
>
>The XML spec says:
>   Processors may signal an error if they receive documents labeled with
>   versions they do not support.
>
>this is exactly what early MIME drafts said.  As soon as one company
>choose to check the version number and fail if it differed, the version
>number was effectively locked in, since any new version was automatically
>incompatible with some compliant implementations of the earlier version --
>even if it was only a minor revision.  To get the most out of the version
>number, you should have indicated that parsers must not signal an error if
>the minor version (the portion after the ".") mismatched.

Thank you for the clarification.  I respectfully submit that the key
ingredient in making the version number useful is the willingness to
allow old software to fail gracefully when it encounters version
numbers it was not written to handle.  If those responsible for MIME
chose not to allow this, they must have had other things on their
mind than making the version number useful.

The rule you suggest is a plausible one, and one I think some
developers are likely to use in deciding whether to try to parse a
document.  But including it in the specification would have involved
first a commitment to future version numbers of a particular form,
which the WG was not willing to make, and would have implied, second,
a commitment on the part of the W3C, or of the WG, to develop and
distribute future versions of XML.  The WG was not authorized, and the
W3C was not disposed, to make such a commitment.

>> >* Reference to UCS-2 which doesn'treally exist.
>> 
>> What does 'really exist' mean?  ...
>
>UCS-2 is a myth; as soon as a codepoint is assigned outside the BMP, there
>is no 16-bit character set.  I consider it a synonym for "UTF-16" which
>does exist and is the correct label.

>ISO definitions often don't match reality.

UCS-2 is not a synonym for UTF-16; the two encodings differ in some
crucial ways.  Please do not attempt to sell me any software which
relies on synonymy of this kind.

In character set matters, as in others, vendors often deviate from ISO
standards.  In character sets, the deviations of commercial vendors
from ISO uniformly make the ISO standards look better and better to me
as a user and as a computing professional responsible for supporting
end users.  If I have to make a choice between an ISO standard and a
specification issued by a commercial body with no commitment to open
processes, I will always choose the ISO standard, and so will any
users I have any influence on.  Fortunately, the alignment between
Unicode and ISO 10646 seems to be holding.

The XML spec cites both the Unicode specification and ISO 10646 as
equal authorities for character set issues.  This is the result of a
conscious decision reached after extensive discussion.  If you wish
the work group to remove the reference to ISO 10646 and to UCS-2, some
argument other than standard off-the-shelf sneers at international
standardization is going to be necessary.

>> >* Byte-order mark replicates TIFF problem.
>> 
>> Can someone explain this?
>
>TIFF files are permitted to be either big-endian or little-endian with a
>magic number at the beginning indicating which.  Sound familiar?
>
>Well look at what happened...  Some products supported both variations,
>some supported only one.  ...

>If you had just said "XML in UTF-16 is always stored and transmitted in
>network byte order (big-endian)", there would be no
>interoperability problems.  As it is, I predict exactly the same thing
>will happen to XML as happened to TIFF, for exactly the same reasons.

The Unicode specification already specifies fairly clearly what the
obligations of Unicode-supporting software are.  There is no reason
whatever for the XML work group to re-do the work of the Unicode
consortium or to second-guess its results in this question.  There is
a simple specification now; people should implement it.  Making XML
have rules different from Unicode rules for the BMP is a sure recipe
for real interoperability problems.


-C. M. Sperberg-McQueen
 Senior Research Programmer, University of Illinois at Chicago
 Editor, ACH/ACL/ALLC Text Encoding Initiative
 Co-coordinator, Model Editions Partnership

 cmsmcq@uic.edu, tei@uic.edu
Received on Friday, 19 June 1998 18:56:55 UTC