SERIOUS concerns about implementation from Peter Murray-Rust on 1997-02-20 (w3c-sgml-wg@w3.org from February 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Thu, 20 Feb 1997 10:39:34 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <3647@ursus.demon.co.uk>
<WARNING>
This contains a contentious suggestion and I apologise if I have missed
something obvious - I am not an expert in the management of existing
SGML systems and I simply post the concerns of a typical webhacker.
</WARNING>

I have spent the last 12 hours trying to get EXISTING SGML files to work
with XML and am now deeply worried that the syntax of XML Phase I is 
seriously flawed and it will be very difficult to market.  (The good
news is that I think it's *technically* very easy to mend.)

[Please forgive the loose terminology in contrasting 'SGML' with 'XML']

<EXAMPLE>
If you install Panorama there is a directory called catalog, with about
35 files which are a mixture of popular *.dtd and *.ent.
EVERY SINGLE FILE WILL BREAK XML UNLESS THE SYNTAX IS CHANGED.
(I can think of no simple kludge for this problem.)  If XML goes
ahead with 961114, then the only solution is to have a modified
version of every file which works with XML.  (This file will also
work with SGML, of course, but it would need the *supplier* to authorise
a change.)
</EXAMPLE>


The problem arises from (at least) the following:
<!-- .* --> is illegal in XML
<? .* > is illegal in XML (I don't know whether there are any PIs in the
	distribution, but there could be.
<!ENTITY % Foo "-//FOO//ENTITIES//EN"> %Foo; is not supported

<AXIOM>
XML must be able to interoperate with these files UNCHANGED
</AXIOM>

The good news is that all these files are easy to parse (i.e. *I* can do
it, I think).  They all have identifiable tokens and (although I haven't
tried to parse DTDs - I only do documents at present - it looks 'trivial').

Why was <!--* .* *--> introduced?  Presumably it is possible for someone
to write something with embedded '--' that is difficult to parse.  I doubt
very much if this happens in any of the DTDs or entity sets.  If it *does*
we have to write better parsers, or put pressure on the 0.1% of people
who write poor DTDs.

I assume <? ... ?> was introduced for the same reason.  That someone might
write
<?XML VERSION=">1.0">
or something like that.  This is still easy to parse if the rules say that
the quotes are required.  I am sure that it is *possible* to create some
monster that breaks our syntax, but in practice it won't be a problem.

I don't understand the forbidding of PUBLIC ENTITYs at all.  I found it
a non-issue until this debate started.  You get a DTD from (say) Panorama
which requires an EntitySet.  It is *identified* by an FPI.  Panorama even
send you this file *in the same directory*.  The catalog tells what its name
is.  Even I have managed to write code that resolves this.  My simple
understanding is that FPIs identify the document just like an ISBN identifies
a book.  Up until now it has been *my* responsibility to get the document.
Panorama helps me by downloading lots of them.  I assume that most of you
in practice have a directory full of commonly used DTDs and Entity Sets
which you use every day.  [The problem only arises when someone MODIFIES
a document which purports to be the true document with a given FPI.  I assume
that not only is this bad practice, it may also break copyright, particularly
if it is retransmitted].

<COROLLARY>
The result of this is that EVERY file I have been working with up to now
(document instances, DTDs, catalogs, EntitySets) are broken w.r.t XML.  I 
have to modify them, often by hand.
</COROLLARY>

<COROLLARY>
The rest of the world is not going to like this
</COROLLARY>

<STRATEGY>
XML should interoperate with existing DTDs/catalogs/entitySets as far as
possible.  If not possible XML should make accommodation for parsing 
problems in very
commonly-used WELL-FORMED documents.  (In a very small number of cases
there may be grey areas - but you have lots of these with SGML anyway -
ambiguous content, whitespace, etc. - these are far more likely to
give problems).
</STRATEGY>

<PROPOSAL>
Rescind <!--* *-->
</PROPOSAL>

<PROPOSAL>
Rescind <? ?>
</PROPOSAL>

<PROPOSAL>
Allow the use of PUBLIC ENTITYs and other similar constructions in DTDs.
</PROPOSAL>

There remain other problems which I am not expert enough to make comments
on.  I flag up two:

The '- -', '- o' fields in ELEMENT.  (I don't know its name).  You have
to be able to skip these if you parse an existing DTD.

SDATA in entitySets.  I don't know what it does, but it should be
kludgeable.

                        --------------------------

One point to bear in mind is that unless you constantly think about
implementation, you are going to run into problems.  I have seen several
languages die because no one bothered to think of implementation - it
was assumed to be easy.  It is always harder than you think, which is
why I urge caution on Phase II.

I know that it is very antisocial to suggest major revisions at a late
stage and I am a believer in communal discipline.  If I have missed something
obvious please forgive this posting.

	P.


-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Thursday, 20 February 1997 06:34:37 UTC