Re: XML design errors? from Chris Newman on 1998-06-19 (xml-editor@w3.org from April to June 1998)

From: Chris Newman <Chris.Newman@innosoft.com>
Date: Fri, 19 Jun 1998 13:36:30 -0700 (PDT)
To: C M Sperberg-McQueen <cmsmcq@uic.edu>
Cc: connolly@w3.org, xml-editor@w3.org
Message-id: <Pine.SOL.3.95.980619130742.29003B-100000@elwood.innosoft.com>
On Fri, 19 Jun 1998, C M Sperberg-McQueen wrote:
> >* Didn't define BNF notation.
> 
> Your copy is incomplete, then.

I'm looking at: <http://www.w3.org/TR/1998/REC-xml-19980210.html>

This was my mistake.  I missed the BNF definition section because it was
towards the end of the document, but not in an appendix.

> >* Processing Instructions introduce interoperability problems.  There
> >  is also no registry for PITargets.
> 
> I have always seen processing instructions as a fairly useful way
> to avoid and minimize interoperability problems:  using them the
> owner of the data can hide processing instructions intended for one
> family of processes from other processes -- or rather any process gets
> the hooks necessary to allow it to recognize and ignore instructions
> intended for other systems.
> 
> So the claim that PIs "introduce" interoperability problems takes me
> by surprise.  It seems flat wrong to me:  PIs allow us to deal with
> some of the interoperability problems that already exist.
> 
> Can anyone give an example of an interoperability problem introduced
> by the notion of processing instructions that could not occur without
> them?

Vendor A uses a PI which alters the processing of the document.  Vendor
A's product generates documents relying on that PI.  Take documents from
Vendor A to Vendor B (which ignores the PI), and the documents don't look
the same.  Since both Vendor A and Vendor B are compliant, the result is
a legal interoperability problem.

> >* "<![CDATA[" notation is cumbersome and creates new parser state and
> >  alternate representations.
> 
> It's much less cumbersome than the alternative, which is to escape
> each delimiter in the block individually.

This is the same mistake which was made in html with the <PLAINTEXT>
(I forget the exact name used) tag.  That tag was obsoleted in favor of
the <pre> tag.  A similar mistake was made in an early draft of the
text/enriched media type, but was corrected in the final version.

It turns out it's easier and cleaner to have one parser state and quote
characters appropriately than it is to have two parser states with
different quoting conventions.  Especially if the second parser state is
infrequently used, it causes no end of bugs, complications and problems.

Because text/enriched went through a public review process in the IETF, 
this problem was identified and eliminated before it was published.  Shame
that XML lacked a similar public review process.

> It does create a new
> parser state and allow alternate representations of the same character
> stream; since providing only a single representation for a given
> character stream is not a goal of XML, I am not sure why this counts
> as a weakness.
> 
> If it were a goal, the use of any existing character set standard 
> would defeat it in short order.

One should always minimize alternate representations for data.  Every
alternate representation introduces new chances for bugs which have to be
tested specially.

> >* Version number text is broken -- likely to leave things stuck at
> >  "1.0" just like MIME-Version.
> 
> How?  My understanding is that MIME built the version number into
> the grammar, so that conforming MIME parsers were required to 
> reject version numbers other than 1.0.  If the XML spec makes such
> a requirement, I don't see where.  The relevant sentence text
> says that it is an error to use the version number 1.0 if the
> document does not conform to version 1.0 of the spec; it does not
> say, in anything I see, that version 1.0 processors are required to
> signal an error if they see any other version number.  I'm not
> even sure they are even allowed to signal an error solely on the
> basis of the version number.

The XML spec says:
   Processors may signal an error if they receive documents labeled with
   versions they do not support.

this is exactly what early MIME drafts said.  As soon as one company
choose to check the version number and fail if it differed, the version
number was effectively locked in, since any new version was automatically
incompatible with some compliant implementations of the earlier version --
even if it was only a minor revision.  To get the most out of the version
number, you should have indicated that parsers must not signal an error if
the minor version (the portion after the ".") mismatched.

> >* Reference to UCS-2 which doesn'treally exist.
> 
> What does 'really exist' mean?  UCS-2 was defined by ISO 10646
> the last time I looked; if you don't have access to 10646,
> consult appendix C of Unicode 2.0.
> 
> If definition in an ISO standard does not meet the definition of
> real existence, then 'real existence' is not an interesting or
> useful concept for discussing the XML spec.

UCS-2 is a myth; as soon as a codepoint is assigned outside the BMP, there
is no 16-bit character set.  I consider it a synonym for "UTF-16" which
does exist and is the correct label.

ISO definitions often don't match reality.

> >* Too many encoding variations.  &#x;  &#; &; UTF-8, UTF-16.
> 
> Personally, I would agree:  I think decimal character references, and
> UTF-8, would be better off omitted.  But I'm not sure the spec 
> would really be better technically in that case: just smaller.  And
> it would definitely be less widely adopted.

The problem is that each encoding variation introduces the possibility of
undetected bugs and interoperability problems.  I doubt all 6 ways of
representing a character will be used.

> >* Byte-order mark replicates TIFF problem.
> 
> Can someone explain this?

TIFF files are permitted to be either big-endian or little-endian with a
magic number at the beginning indicating which.  Sound familiar?

Well look at what happened...  Some products supported both variations,
some supported only one.  This created all sorts of programs which didn't
interoperate.  The problem became so serious that vendors had to bring the
interoperability issue into the user interface.  Now most graphics
software which saves TIFF files offers a "PC-variant" and a "Mac-variant"
of TIFF files (that's the most human friendly way they could come up with
expressing the big-endian little-endian difference).

If you had just said "XML in UTF-16 is always stored and transmitted in
network byte order (big-endian)", there would be no
interoperability problems.  As it is, I predict exactly the same thing
will happen to XML as happened to TIFF, for exactly the same reasons.

		- Chris
Received on Friday, 19 June 1998 16:35:23 UTC