- From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
- Date: Thu, 20 Feb 1997 10:39:34 GMT
- To: w3c-sgml-wg@w3.org
<WARNING> This contains a contentious suggestion and I apologise if I have missed something obvious - I am not an expert in the management of existing SGML systems and I simply post the concerns of a typical webhacker. </WARNING> I have spent the last 12 hours trying to get EXISTING SGML files to work with XML and am now deeply worried that the syntax of XML Phase I is seriously flawed and it will be very difficult to market. (The good news is that I think it's *technically* very easy to mend.) [Please forgive the loose terminology in contrasting 'SGML' with 'XML'] <EXAMPLE> If you install Panorama there is a directory called catalog, with about 35 files which are a mixture of popular *.dtd and *.ent. EVERY SINGLE FILE WILL BREAK XML UNLESS THE SYNTAX IS CHANGED. (I can think of no simple kludge for this problem.) If XML goes ahead with 961114, then the only solution is to have a modified version of every file which works with XML. (This file will also work with SGML, of course, but it would need the *supplier* to authorise a change.) </EXAMPLE> The problem arises from (at least) the following: <!-- .* --> is illegal in XML <? .* > is illegal in XML (I don't know whether there are any PIs in the distribution, but there could be. <!ENTITY % Foo "-//FOO//ENTITIES//EN"> %Foo; is not supported <AXIOM> XML must be able to interoperate with these files UNCHANGED </AXIOM> The good news is that all these files are easy to parse (i.e. *I* can do it, I think). They all have identifiable tokens and (although I haven't tried to parse DTDs - I only do documents at present - it looks 'trivial'). Why was <!--* .* *--> introduced? Presumably it is possible for someone to write something with embedded '--' that is difficult to parse. I doubt very much if this happens in any of the DTDs or entity sets. If it *does* we have to write better parsers, or put pressure on the 0.1% of people who write poor DTDs. I assume <? ... ?> was introduced for the same reason. That someone might write <?XML VERSION=">1.0"> or something like that. This is still easy to parse if the rules say that the quotes are required. I am sure that it is *possible* to create some monster that breaks our syntax, but in practice it won't be a problem. I don't understand the forbidding of PUBLIC ENTITYs at all. I found it a non-issue until this debate started. You get a DTD from (say) Panorama which requires an EntitySet. It is *identified* by an FPI. Panorama even send you this file *in the same directory*. The catalog tells what its name is. Even I have managed to write code that resolves this. My simple understanding is that FPIs identify the document just like an ISBN identifies a book. Up until now it has been *my* responsibility to get the document. Panorama helps me by downloading lots of them. I assume that most of you in practice have a directory full of commonly used DTDs and Entity Sets which you use every day. [The problem only arises when someone MODIFIES a document which purports to be the true document with a given FPI. I assume that not only is this bad practice, it may also break copyright, particularly if it is retransmitted]. <COROLLARY> The result of this is that EVERY file I have been working with up to now (document instances, DTDs, catalogs, EntitySets) are broken w.r.t XML. I have to modify them, often by hand. </COROLLARY> <COROLLARY> The rest of the world is not going to like this </COROLLARY> <STRATEGY> XML should interoperate with existing DTDs/catalogs/entitySets as far as possible. If not possible XML should make accommodation for parsing problems in very commonly-used WELL-FORMED documents. (In a very small number of cases there may be grey areas - but you have lots of these with SGML anyway - ambiguous content, whitespace, etc. - these are far more likely to give problems). </STRATEGY> <PROPOSAL> Rescind <!--* *--> </PROPOSAL> <PROPOSAL> Rescind <? ?> </PROPOSAL> <PROPOSAL> Allow the use of PUBLIC ENTITYs and other similar constructions in DTDs. </PROPOSAL> There remain other problems which I am not expert enough to make comments on. I flag up two: The '- -', '- o' fields in ELEMENT. (I don't know its name). You have to be able to skip these if you parse an existing DTD. SDATA in entitySets. I don't know what it does, but it should be kludgeable. -------------------------- One point to bear in mind is that unless you constantly think about implementation, you are going to run into problems. I have seen several languages die because no one bothered to think of implementation - it was assumed to be easy. It is always harder than you think, which is why I urge caution on Phase II. I know that it is very antisocial to suggest major revisions at a late stage and I am a believer in communal discipline. If I have missed something obvious please forgive this posting. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/
Received on Thursday, 20 February 1997 06:34:37 UTC