Re: Fwd: Is there a tool which tells me if my XML is "fully normalized"? from Paul Grosso on 2013-02-20 (public-xml-core-wg@w3.org from February 2013)

From: Paul Grosso <paul@paulgrosso.name>
Date: Wed, 20 Feb 2013 11:09:39 -0600
To: public-xml-core-wg@w3.org
Message-ID: <51250353.70407@paulgrosso.name>
Norm et al.,

Something you said below confuses me.

Under 2, you say "The content of that element is a single
text node containing a single character", but then under 3,
you say "There's nothing that isn't Unicode normalized
about that sequence [of] 27 characters."

But "27 characters" implies "&#x0338;" is 8 characters,
not "a single text node containing a single character".

Can you unravel my confusion?

paul

On 2013-02-18 12:08, Norman Walsh wrote:
> Paul Grosso <paul@paulgrosso.name> writes:
>> -------- Original Message --------
>>      Subject: Is there a tool which tells me if my XML is "fully normalized"?
>> Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000
>> Resent-From: xml-editor@w3.org
>>         Date: Sat, 16 Feb 2013 22:56:36 +0000
>>         From: Costello, Roger L. <costello@mitre.org>
>>           To: xml-editor@w3.org <xml-editor@w3.org>
>>
>> Hi Folks,
>>
>> 1. Is there a tool which evaluates an XML document and returns an
>> indication of whether it is fully normalized or not?
> Not that I'm aware of. But if such a tool exists for "text files", it
> applies equally to XML documents, I think.
>
>> 2. This element:
>>
>>          <comment>&#x338;</comment>
>>
>> is not fully normalized, right? (Since the content of the <comment>
>> element begins with a combining character and "content" is defined
>> to be a "relevant construct.") Note: hex 338 is the combining
>> solidus overlay character.
> The content of that element is a single text node containing a single
> character that happens to be the combining solidus overlay character.
> I suppose you could ask "is the string content of that text node Unicode
> normalized?" but I don't think XML cares.
>
>> 3. Section 2.13 of the XML 1.1 specification says:
>>
>>          XML applications that create XML 1.1 output from either XML 1.1 or
>>          XML 1.0 input SHOULD ensure that the output is fully normalized
>>
>> What should an XML application output, given this non-fully-normalized input:
>>
>>          <comment>&#x0338;</comment>
>>
>> How does an XML application "ensure that the output is fully normalized"?
> I think a processor that outputs
>
>    <comment>&#x0338;</comment>
>
> *has* produced fully normalized output. There's nothing that isn't Unicode
> normalized about that sequence 27 characters.
>
> A processor that produced
>
>    <comment>X</comment>
>
> where "X" is a single U0338 character would not be producing
> normalized output, IMHO.
>
>> 4. If the combining solidus overlay character follows a greater-than
>> character in element content:
>>
>>          <comment> &gt;&#x0338; </comment>
>>
>> then normalizing XML applications will combine them to create the
>> not-greater-than character:
>>
>>          <comment> ≯ </comment>
> That is an interesting question. Do we define "normalizing XML
> applications"? Where?
>
> Saxon and rxp produce
>
>     <comment> &gt;/ </comment>
>
> (where the combining slash applies to the semicolon on my display),
> but MarkLogic produces
>
>     <comment> ≯ </comment>
>
> I don't know if both of those are conformant or not. The only
> reference I see to Unicode normalization in the XML Rec is in
> non-normative Appendix J.
>
> This stylesheet demonstrates that the problem is at the data model level:
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>                  xmlns:xs="http://www.w3.org/2001/XMLSchema"
>   exclude-result-prefixes="xs"
>                  version="2.0">
>
> <xsl:output method="xml" encoding="utf-8" indent="no"
>      omit-xml-declaration="yes"/>
>
> <xsl:variable name="comment" as="element()">
>    <comment> &gt;&#x0338; </comment>
> </xsl:variable>
>
> <xsl:template match="/">
>    <doc>
>      <xsl:value-of select="string-length($comment/node())"/>
>    </doc>
> </xsl:template>
>
> </xsl:stylesheet>
>
> Saxon produces <doc>4</doc>; MarkLogic <doc>3</doc>.
>
>> However, if the combining solidus overlay character follows a greater-than character that is part of a start-tag:
>>
>>          <comment>&#x0338;</comment>
>>
>> then normalizing XML applications do not combine them:
>>
>>          <comment>/</comment>
>>
>> There must be some W3C document which says, "The long solidus
>> combining character shall not combine with the '>' in a start tag
>> but it shall combine with the '>' if it is located elsewhere."
> We don't need to say that. Recognizing markup boundaries takes place before
> or, at the very latest, at the same time as entity expansion. There *is no*
> ">" in front of the &#x0338; character when the entity is expanded.
>
>                                          Be seeing you,
>                                            norm
>
> --
> Norman Walsh
> Lead Engineer
> MarkLogic Corporation
> Phone: +1 512 761 6676
> www.marklogic.com
Received on Wednesday, 20 February 2013 17:10:21 UTC