Re: RS/RE, again (sorry)
(sorry for the last post... eudora maps Ctrl-E to send... not end of line...)
At 08:20 AM 12/17/96 -0800, Jon Bosak wrote:
>| >[Chris Maden:]
>| >| 3) A dichotomy between "DTD-ful" and DTD-less parsing will make any
>| >| sibling-based relationship difficult at best; this will affect some
>| >| TEI or HyQ based hyperlinks, as well as sibling-based stylistic
>| >| decisions.
>| >Sorry to be so slow here, but what's the connection with sibling
>| >relationships? My idea of a well-formed XML document is one for which
>| >there is just one possible tree structure; what's different about
>| >sibling relationships if a DTD is provided?
>To which a kind correspondent replied:
>| A DTD-less parser will interpret element-content whitespace as a #PCDATA
>| node. A DTD-full parser will just strip it out. The number of nodes in your
>| document will change.
>| Each newline will be a node in one, and not the other.
>Allow me to wallow in ignorance a bit further. I'm finding it hard to
>visualize a situation in which I would want to address something based
>on pseudo-element relationships rather than "genuine" tree
>relationships. It's easy to imagine cases where I would want to refer
>to the TITLE descendant of my ancestor CHAPTER, for example, but I
>have never wanted to refer to the third linefeed in an element. I'm
>not saying that such situations are inconceivable, I'm just saying
>that I've never encountered one. Is this one of those cases where 90
>percent of the complexity we're worrying about is being caused by a
>feature that in practice is used .001 percent of the time?
The problem might be better visualised....
Here is the element content parser, wher the RE's are ignored, thus
there is a <LIST> with 2 child nodes, each a <UITEM> element.
<LIST> -+-- <UITEM>
Here is the parse with RE's significant. Note that there are now five
subnodes of the <LIST>.
<LIST> -+-- #PCDATA "\n"
+-- #PCDATA "\n"
+-- #PCDATA "\n"
The (obvious) problem arises when you ask for the 2nd child of the <LIST>
element. The first parse will give you the second <UITEM> sub-element,
while the second parse will give you a pseudo-element containing only a "\n".
Without some way to indicate that this should be treated as element content,
this could easily become a real mess.
"that which is not slightly distorted lacks sensible appeal: from which it
that irregularity - that is to say, the unexpected, surprise, and astonishment,
are an essential part and characteristic of beauty" - Charles Baudelaire