Re: XML question for the experts from John Cowan on 2007-12-07 (public-xml-core-wg@w3.org from December 2007)

From: John Cowan <cowan@ccil.org>
Date: Fri, 7 Dec 2007 17:03:52 -0500
To: "Grosso, Paul" <pgrosso@ptc.com>
Cc: public-xml-core-wg@w3.org
Message-ID: <20071207220352.GE3346@mercury.ccil.org>

Grosso, Paul scripsit:

> If a serialized XML document contains:
> 
> <!--This is a comment &#x2014; pbg-->
> 
> or
> 
> <?myproc pseudoatt="this is part of a pi &#x2014; pbg"?>
> 
> then when that is read by an XML processor, is the
> &#x2014; considered to be a seven character string 
> or the Unicode em-dash character?

Clearly the former.  Comments and PIs contain simply Chars, which means
that NCRs are not recognized in them.  Compare productions 15 (Comment)
and 16 (PI) with 10 (AttValue) and 43 (Content).

> More precisely, in the infoset of such a document,
> when considering the comment or PI's [content] info item,
> would the length of the "string representing the content"
> be calculated with the "&#x2014;" part contributing 1 or 7 
> to the length?

Seven.

> Put another way, if the following XSLT template matched
> the above comment, should the xsl:if test succeed or fail:
> 
> <xsl:template match="comment()">
>   <xsl:if test="string(.)='This is a comment - pbg'">
>     <!-- The above line's em-dash is the single U-2014 character -->
>   </xsl:if>
> </xsl:template>

Fail.


-- 
John Cowan
        cowan@ccil.org
                I am a member of a civilization. --David Brin

Received on Friday, 7 December 2007 22:04:02 UTC