Re: RS/RE, again (sorry)
At 12:01 PM 12/11/96, Tim Bray wrote:
I'll try to be brief, as I've said enough on this previously, but since my
position is maybe slightly different this time around, I'll comment. I will
also refer to some other responses, to this note.
>Some feel the mechanism is unaesthetic and prone to misinterpretation
>as it stands, and should simply be discarded, with all non-markup
>bytes being passed to the application to do with as it will. This
>can be made 8879-compliant in the short term via a mechanism proposed by
>Charles Goldfarb, and in the medium/long term with a TC via WG8.
as noted, documents parsed w/ and w/out DTDs will be different in this
approach. Paul P. and Eve Mahler have claimed that ignoring whitespace in
element content will be a great hardship -- but I've not seen the proof of
this yet. Validating parsers could ignore those spaces for applications.
>There are some problems with both the current and revised approaches:
>o -XML-SPACE, although this is not documented, really only deals with
> mixed content; many feel it's important to ignore white space in
> element content; <list> <item>..</item> <item>..</item> </list>
> e.g. the above
> but XML, when there's no DTD, doesn't know where element content is and
> *cannot* be made to do this.
If this is the case, the only good justification for this hack has died, I
>o SGML's world-view tripartions the set of characters:
> those that are text, those that are markup, and those that are
> insignificant white space. Can XML really afford to discard this
I think it must. See below for one reason why. The other is even simpler.
This tripartite division confuses almost everybody -- If we have trouble
keeping it straight, without dedicating significant effort, how could we
propagate it further to a more naive, less-motivated public?
>o Many real-world editors, largely to deal with the fact that
> text (whether or not we like it) is stored in files in what amounts
> to a series of records, freely insert line breaks and other white
> space because they know SGML processors will ignore it. Can we
> afford to make that white space significant?
Since those editors will already require a minor facelift to work with XML
anyway, removing a few "\n"s in the code is likely to be easy. I think this
issue is less-important than some make it. Yes, we might have to change
software, but no, it is not a hard change, even added to the few others we
have already admitted.
>o Some applications, e.g. full-text indexers, really need to know where
> everything is by byte offset, whether or not the bytes are significant;
> thus the -XML-SPACE="COLLAPSE" behavior means they can't read the text
> with an XML processor (unless they can turn off -XML-SPACE processing
> through the API)
This explanation of byte-offset requirements is a bit confusing: we could
always add an interface to parsers to give the current byte-offset within
the underlying entity. The real problem is that in many linking
applications we would like to address characters within an element: and
here the distinction becomes more problematic, as the user's view of the
element differs from the system's view by some number of "insignificant"
characters. That's why "insignificance" should die.
>So there are a few things we could do, which are not entirely
>1. Go to the RE Delenda Est model. This has the advantage that it's
> trivially easy to explain, document, and implement. It has some of the
> disadvantages listed above; there is some very strong sentiment on the
> ERB against this - look from a follow-up from other ERB-folk.
I doubt the pretty-printing contingent will accept this (although CRLF
before the end ">" of tags actually allows almost-pretty printing). I still
like this, but given its overall unpopularity, don't expect it to live.
(Just to simple to survive, I guess).
>3. Add language to the spec allowing the application to force the
> processor to pass through all the bytes regardless of the -XML-SPACE
4. Pass all whitespace through in DTD-free parsing mode. For the
pretty-printing contingent, we make validation parsing mode different: It
will strip all whitespace in element content. I guess this is "RE delenda
est" with a failover.
In addition, optionally allow 3: validating parsers may be made to pass all
space (and comments) to applications (like indexers) that require it. We
should add a note that processors that depend on such space for any form of
human-presentation or formatting violate the intent of the standard.
I am not a number. I am an undefined character.
David Durand email@example.com \ david@dynamicDiagrams.com
Boston University Computer Science \ Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams
MAPA: mapping for the WWW \__________________________