- From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
- Date: Wed, 07 May 1997 11:34:16 GMT
- To: w3c-sgml-wg@w3.org
We have been asked to concentrate on XML-lang and xml-link ... so. I am finding myself thoroughly confused by whitespace handling. Although I suspect that the draft is consistent and the ERB/WG all agree on what it's meant to do, please treat the following as a typical webhacker confusion. It may be that special explanation is required in the draft, because otherwise the HTML2XML community will be thoroughly confused. I will use a very simplified subset of CML as illustration. (I will omit minimisation flags to save typing - otherwise all examples should be interchangeable between SGML and XML). Please also forgive errors. The document: <?XML VERSION="1.0"?> <!DOCTYPE CML [ <!ELEMENT CML (XVAR)*> <!ELEMENT XVAR (#PCDATA)> ]> <CML> <XVAR> A variable </XVAR> </CML> parses with sgmls to give a CML element which contains an XVAR element whose content is 'A variable'. There are no other #PCDATA elements. I can include as much whitespace (space, newline) between tags as I like and the result is the same. If I write <!ELEMENT CML (#PCDATA|XVAR)*> instead, it also validates, but gives a different result, with additional #PCDATA elements (content '\n') on either side of the XVAR element. If I use ANY as the content model of CML it does the same. NXP appears to do the same as sgmls on a cursory inspection, even without validation switched on. Let's move to WF mode... If ELEMENTS are removed from the DTD subset (or there is no DTD at all, then they are assumed to have content model of ANY. ***This will result in additional PCDATA nodes in the tree***. This is doubtless not news to any of you, but it's a shock to me, that WF documents and validated documents ***GIVE DIFFERENT OUTPUT***. I am sure that this will be a rich source of confusion. Ideally I would like to add an XML-LANG option that 'fixed' the problem, but I'm not sure it's fixable :-) CML is straightforward in that PCDATA only occurs in two elements in the DTD, and those can only have #PCDATA. I can throw out 'spurious' PCDATA nodes later, but that seems to me to be DTD-dependent and we don't have a flag that can signal this. So (expecting the answer 'no') **is there any way to modify XML-lang to suppress PCDATA elements having only whitespace content in certain contexts?** (I thought that was the original intent of the first draft). So far we agree on the ***present output for the parser*** if we can't change XML-lang. It differs according to whether the parser uses some or all of the DTD, even for a WF document. What do we do with what we get? The spec is not very much help. PRESERVE says 'take exactly what you get. That's what the author+DTD wants you to have'. DEFAULT says 'up to the application', which doesn't help the implementer. I still find the terms 'application', 'parser' and 'processor' are not clear in my mind, and it is further confused by the common usage 'HTML is an *application* of SGML'. [BTW - has the hanging sentence in 2.8 been modified?] I am assuming that the 'application' is a program, distinct from the parser (which is a 'processor'?) and that JUMBO is an application (a generic one). Therefore it's up to **JUMBO** what it does with DEFAULT, right? This is independent of the DTD, and the DTD author, and author of the WF document can have no control over the way DEFAULT is implemented. It's quite possible that some applications could decide to throw out (delete from tree) the 'spurious' PCDATA, while others might collapse it to a single space, others to a null string and some simply use PRESERVE (as JUMBO does). The author and the DTD have no control over this. This has serious consequences for WF documents because although this is strictly logical it's anything but intuitive. It means that a document like the one above is highly dangerous without its DTD, which seems a pity because it's eminently useful. If all CML documents have to be presented as <CML><XVAR>A variable</XVAR></CML> this is unworkable, since this can run to tens of thousands of characters without a line break and this breaks text editors. And remember (Article 6) 'human-legible and reasonably clear'. Ideally we need a fix for this. If none is possible, then we need a VERY clear exposition of this. It also means that non-validating parsers (or at least parsers which cannot read the DTD) will give different outputs from validating parsers. I have run Lark briefly over the top file, Tim, and my impression is that Lark puts in the 'spurious' PCDATA nodes, whilst NXP doesn't. [Forgive me if I'm wrong here, Tim]. This would imply that it's possible to get parsers that give different output, different output on validation/non_validation, different output with (no)DTD subset, and different treatment of DEFAULT by browsers. [This is with a properly well-formed document with balanced quotes, tags, and the rest :-). ] IMO this gives people an awful lot of places to go wrong. However the solution is not to get rid of WF documents, as has been suggested, but to make these aspects of behaviour much clearer. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/
Received on Wednesday, 7 May 1997 06:40:41 UTC