[Prev][Next][Index][Thread]

Re: Element content the real issue?...




Paul Prescod <papresco@calum.csclub.uwaterloo.ca> wrote:
>
> What about a third rule:
>
> 1. All white space, including RS and RE, immediately following start tags and
>    immediately preceding end tags is not significant.
>
> 2. All other RS/REs are collapsed to a single space.
>
> 3. All quasi-elements containing only whitespace characters are not significa
> nt.


That should be, erm, "quasi-pseudoelement"...  or maybe not :-)


You may be onto something here.  How about the following
as a heuristic to distinguish element content from mixed content:

    3. If the only data appearing between two tags is a sequence of
       lexical SEPCHARs (including RS and RE), then it is deemed
       insignificant.

where "lexical" means SEPCHARs that appear as SEPCHARs
in the input (as opposed to e.g., <P>&#RE;&space;&space;&#RE;</P>),
and "data" is as per ISO 8879.

This heuristic will incorrectly strip out any "true" pseudoelements
that contain nothing but lexical whitespace -- these would have to be
escaped or entered as references as you point out -- but I think it
will do the right thing in all other cases.


I forget... what was the rationale behind rules (1) and (2)?
(I know it's a common application convention, but what was the
reason for making it mandatory for all XML document types?)



--Joe English

  jenglish@crl.com


References: