SERIOUS concerns about implementation

This contains a contentious suggestion and I apologise if I have missed
something obvious - I am not an expert in the management of existing
SGML systems and I simply post the concerns of a typical webhacker.

I have spent the last 12 hours trying to get EXISTING SGML files to work
with XML and am now deeply worried that the syntax of XML Phase I is 
seriously flawed and it will be very difficult to market.  (The good
news is that I think it's *technically* very easy to mend.)

[Please forgive the loose terminology in contrasting 'SGML' with 'XML']

If you install Panorama there is a directory called catalog, with about
35 files which are a mixture of popular *.dtd and *.ent.
(I can think of no simple kludge for this problem.)  If XML goes
ahead with 961114, then the only solution is to have a modified
version of every file which works with XML.  (This file will also
work with SGML, of course, but it would need the *supplier* to authorise
a change.)

The problem arises from (at least) the following:
<!-- .* --> is illegal in XML
<? .* > is illegal in XML (I don't know whether there are any PIs in the
	distribution, but there could be.
<!ENTITY % Foo "-//FOO//ENTITIES//EN"> %Foo; is not supported

XML must be able to interoperate with these files UNCHANGED

The good news is that all these files are easy to parse (i.e. *I* can do
it, I think).  They all have identifiable tokens and (although I haven't
tried to parse DTDs - I only do documents at present - it looks 'trivial').

Why was <!--* .* *--> introduced?  Presumably it is possible for someone
to write something with embedded '--' that is difficult to parse.  I doubt
very much if this happens in any of the DTDs or entity sets.  If it *does*
we have to write better parsers, or put pressure on the 0.1% of people
who write poor DTDs.

I assume <? ... ?> was introduced for the same reason.  That someone might
<?XML VERSION=">1.0">
or something like that.  This is still easy to parse if the rules say that
the quotes are required.  I am sure that it is *possible* to create some
monster that breaks our syntax, but in practice it won't be a problem.

I don't understand the forbidding of PUBLIC ENTITYs at all.  I found it
a non-issue until this debate started.  You get a DTD from (say) Panorama
which requires an EntitySet.  It is *identified* by an FPI.  Panorama even
send you this file *in the same directory*.  The catalog tells what its name
is.  Even I have managed to write code that resolves this.  My simple
understanding is that FPIs identify the document just like an ISBN identifies
a book.  Up until now it has been *my* responsibility to get the document.
Panorama helps me by downloading lots of them.  I assume that most of you
in practice have a directory full of commonly used DTDs and Entity Sets
which you use every day.  [The problem only arises when someone MODIFIES
a document which purports to be the true document with a given FPI.  I assume
that not only is this bad practice, it may also break copyright, particularly
if it is retransmitted].

The result of this is that EVERY file I have been working with up to now
(document instances, DTDs, catalogs, EntitySets) are broken w.r.t XML.  I 
have to modify them, often by hand.

The rest of the world is not going to like this

XML should interoperate with existing DTDs/catalogs/entitySets as far as
possible.  If not possible XML should make accommodation for parsing 
problems in very
commonly-used WELL-FORMED documents.  (In a very small number of cases
there may be grey areas - but you have lots of these with SGML anyway -
ambiguous content, whitespace, etc. - these are far more likely to
give problems).

Rescind <!--* *-->

Rescind <? ?>

Allow the use of PUBLIC ENTITYs and other similar constructions in DTDs.

There remain other problems which I am not expert enough to make comments
on.  I flag up two:

The '- -', '- o' fields in ELEMENT.  (I don't know its name).  You have
to be able to skip these if you parse an existing DTD.

SDATA in entitySets.  I don't know what it does, but it should be


One point to bear in mind is that unless you constantly think about
implementation, you are going to run into problems.  I have seen several
languages die because no one bothered to think of implementation - it
was assumed to be easy.  It is always harder than you think, which is
why I urge caution on Phase II.

I know that it is very antisocial to suggest major revisions at a late
stage and I am a believer in communal discipline.  If I have missed something
obvious please forgive this posting.


Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences