- From: David G. Durand <dgd@cs.bu.edu>
- Date: Thu, 12 Dec 1996 12:51:20 -0800
- To: w3c-sgml-wg@w3.org
At 12:01 PM 12/11/96, Tim Bray wrote: I'll try to be brief, as I've said enough on this previously, but since my position is maybe slightly different this time around, I'll comment. I will also refer to some other responses, to this note. >Some feel the mechanism is unaesthetic and prone to misinterpretation >as it stands, and should simply be discarded, with all non-markup >bytes being passed to the application to do with as it will. This >can be made 8879-compliant in the short term via a mechanism proposed by >Charles Goldfarb, and in the medium/long term with a TC via WG8. as noted, documents parsed w/ and w/out DTDs will be different in this approach. Paul P. and Eve Mahler have claimed that ignoring whitespace in element content will be a great hardship -- but I've not seen the proof of this yet. Validating parsers could ignore those spaces for applications. >There are some problems with both the current and revised approaches: > >o -XML-SPACE, although this is not documented, really only deals with > mixed content; many feel it's important to ignore white space in > element content; <list> <item>..</item> <item>..</item> </list> > ^^ > e.g. the above > but XML, when there's no DTD, doesn't know where element content is and > *cannot* be made to do this. If this is the case, the only good justification for this hack has died, I think. > >o SGML's world-view tripartions the set of characters: > those that are text, those that are markup, and those that are > insignificant white space. Can XML really afford to discard this > distinction? I think it must. See below for one reason why. The other is even simpler. This tripartite division confuses almost everybody -- If we have trouble keeping it straight, without dedicating significant effort, how could we propagate it further to a more naive, less-motivated public? >o Many real-world editors, largely to deal with the fact that > text (whether or not we like it) is stored in files in what amounts > to a series of records, freely insert line breaks and other white > space because they know SGML processors will ignore it. Can we > afford to make that white space significant? Since those editors will already require a minor facelift to work with XML anyway, removing a few "\n"s in the code is likely to be easy. I think this issue is less-important than some make it. Yes, we might have to change software, but no, it is not a hard change, even added to the few others we have already admitted. >o Some applications, e.g. full-text indexers, really need to know where > everything is by byte offset, whether or not the bytes are significant; > thus the -XML-SPACE="COLLAPSE" behavior means they can't read the text > with an XML processor (unless they can turn off -XML-SPACE processing > through the API) This explanation of byte-offset requirements is a bit confusing: we could always add an interface to parsers to give the current byte-offset within the underlying entity. The real problem is that in many linking applications we would like to address characters within an element: and here the distinction becomes more problematic, as the user's view of the element differs from the system's view by some number of "insignificant" characters. That's why "insignificance" should die. >So there are a few things we could do, which are not entirely >mutually exclusive. > >1. Go to the RE Delenda Est model. This has the advantage that it's > trivially easy to explain, document, and implement. It has some of the > disadvantages listed above; there is some very strong sentiment on the > ERB against this - look from a follow-up from other ERB-folk. I doubt the pretty-printing contingent will accept this (although CRLF before the end ">" of tags actually allows almost-pretty printing). I still like this, but given its overall unpopularity, don't expect it to live. (Just to simple to survive, I guess). >3. Add language to the spec allowing the application to force the > processor to pass through all the bytes regardless of the -XML-SPACE > setting. 4. Pass all whitespace through in DTD-free parsing mode. For the pretty-printing contingent, we make validation parsing mode different: It will strip all whitespace in element content. I guess this is "RE delenda est" with a failover. In addition, optionally allow 3: validating parsers may be made to pass all space (and comments) to applications (like indexers) that require it. We should add a note that processors that depend on such space for any form of human-presentation or formatting violate the intent of the standard. -- David I am not a number. I am an undefined character. _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________
Received on Thursday, 12 December 1996 12:51:54 UTC