Re: Fwd: Is there a tool which tells me if my XML is "fully normalized"? from Norman Walsh on 2013-02-18 (public-xml-core-wg@w3.org from February 2013)

From: Norman Walsh <ndw@nwalsh.com>
Date: Mon, 18 Feb 2013 12:08:34 -0600
To: public-xml-core-wg@w3.org
Message-ID: <m2sj4txzod.fsf@nwalsh.com>
Paul Grosso <paul@paulgrosso.name> writes:
> -------- Original Message --------
>     Subject: Is there a tool which tells me if my XML is "fully normalized"?
> Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000
> Resent-From: xml-editor@w3.org
>        Date: Sat, 16 Feb 2013 22:56:36 +0000
>        From: Costello, Roger L. <costello@mitre.org>
>          To: xml-editor@w3.org <xml-editor@w3.org>
>
> Hi Folks,
>
> 1. Is there a tool which evaluates an XML document and returns an
> indication of whether it is fully normalized or not?

Not that I'm aware of. But if such a tool exists for "text files", it
applies equally to XML documents, I think.

> 2. This element:
>
>         <comment>&#x338;</comment>
>
> is not fully normalized, right? (Since the content of the <comment>
> element begins with a combining character and "content" is defined
> to be a "relevant construct.") Note: hex 338 is the combining
> solidus overlay character.

The content of that element is a single text node containing a single
character that happens to be the combining solidus overlay character.
I suppose you could ask "is the string content of that text node Unicode
normalized?" but I don't think XML cares.

> 3. Section 2.13 of the XML 1.1 specification says:
>
>         XML applications that create XML 1.1 output from either XML 1.1 or
>         XML 1.0 input SHOULD ensure that the output is fully normalized
>
> What should an XML application output, given this non-fully-normalized input:
>
>         <comment>&#x0338;</comment>
>
> How does an XML application "ensure that the output is fully normalized"?

I think a processor that outputs

  <comment>&#x0338;</comment>

*has* produced fully normalized output. There's nothing that isn't Unicode
normalized about that sequence 27 characters.

A processor that produced

  <comment>X</comment>

where "X" is a single U0338 character would not be producing
normalized output, IMHO.

> 4. If the combining solidus overlay character follows a greater-than
> character in element content:
>
>         <comment> &gt;&#x0338; </comment>
>
> then normalizing XML applications will combine them to create the
> not-greater-than character:
>
>         <comment> ≯ </comment>

That is an interesting question. Do we define "normalizing XML
applications"? Where?

Saxon and rxp produce

   <comment> &gt;/ </comment>

(where the combining slash applies to the semicolon on my display),
but MarkLogic produces

   <comment> ≯ </comment>

I don't know if both of those are conformant or not. The only
reference I see to Unicode normalization in the XML Rec is in
non-normative Appendix J.

This stylesheet demonstrates that the problem is at the data model level:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs"
                version="2.0">

<xsl:output method="xml" encoding="utf-8" indent="no"
     omit-xml-declaration="yes"/>

<xsl:variable name="comment" as="element()">
  <comment> &gt;&#x0338; </comment>
</xsl:variable>

<xsl:template match="/">
  <doc>
    <xsl:value-of select="string-length($comment/node())"/>
  </doc>
</xsl:template>

</xsl:stylesheet>

Saxon produces <doc>4</doc>; MarkLogic <doc>3</doc>.

> However, if the combining solidus overlay character follows a greater-than character that is part of a start-tag:
>
>         <comment>&#x0338;</comment>
>
> then normalizing XML applications do not combine them:
>
>         <comment>/</comment>
>
> There must be some W3C document which says, "The long solidus
> combining character shall not combine with the '>' in a start tag
> but it shall combine with the '>' if it is located elsewhere."

We don't need to say that. Recognizing markup boundaries takes place before
or, at the very latest, at the same time as entity expansion. There *is no*
">" in front of the &#x0338; character when the entity is expanded.

                                        Be seeing you,
                                          norm

--
Norman Walsh
Lead Engineer
MarkLogic Corporation
Phone: +1 512 761 6676
www.marklogic.com
Received on Monday, 18 February 2013 18:09:05 UTC