- From: Norman Walsh <ndw@nwalsh.com>
- Date: Mon, 18 Feb 2013 12:08:34 -0600
- To: public-xml-core-wg@w3.org
- Message-ID: <m2sj4txzod.fsf@nwalsh.com>
Paul Grosso <paul@paulgrosso.name> writes:
> -------- Original Message --------
> Subject: Is there a tool which tells me if my XML is "fully normalized"?
> Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000
> Resent-From: xml-editor@w3.org
> Date: Sat, 16 Feb 2013 22:56:36 +0000
> From: Costello, Roger L. <costello@mitre.org>
> To: xml-editor@w3.org <xml-editor@w3.org>
>
> Hi Folks,
>
> 1. Is there a tool which evaluates an XML document and returns an
> indication of whether it is fully normalized or not?
Not that I'm aware of. But if such a tool exists for "text files", it
applies equally to XML documents, I think.
> 2. This element:
>
> <comment≯</comment>
>
> is not fully normalized, right? (Since the content of the <comment>
> element begins with a combining character and "content" is defined
> to be a "relevant construct.") Note: hex 338 is the combining
> solidus overlay character.
The content of that element is a single text node containing a single
character that happens to be the combining solidus overlay character.
I suppose you could ask "is the string content of that text node Unicode
normalized?" but I don't think XML cares.
> 3. Section 2.13 of the XML 1.1 specification says:
>
> XML applications that create XML 1.1 output from either XML 1.1 or
> XML 1.0 input SHOULD ensure that the output is fully normalized
>
> What should an XML application output, given this non-fully-normalized input:
>
> <comment≯</comment>
>
> How does an XML application "ensure that the output is fully normalized"?
I think a processor that outputs
<comment≯</comment>
*has* produced fully normalized output. There's nothing that isn't Unicode
normalized about that sequence 27 characters.
A processor that produced
<comment>X</comment>
where "X" is a single U0338 character would not be producing
normalized output, IMHO.
> 4. If the combining solidus overlay character follows a greater-than
> character in element content:
>
> <comment> ≯ </comment>
>
> then normalizing XML applications will combine them to create the
> not-greater-than character:
>
> <comment> ≯ </comment>
That is an interesting question. Do we define "normalizing XML
applications"? Where?
Saxon and rxp produce
<comment> >/ </comment>
(where the combining slash applies to the semicolon on my display),
but MarkLogic produces
<comment> ≯ </comment>
I don't know if both of those are conformant or not. The only
reference I see to Unicode normalization in the XML Rec is in
non-normative Appendix J.
This stylesheet demonstrates that the problem is at the data model level:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:output method="xml" encoding="utf-8" indent="no"
omit-xml-declaration="yes"/>
<xsl:variable name="comment" as="element()">
<comment> ≯ </comment>
</xsl:variable>
<xsl:template match="/">
<doc>
<xsl:value-of select="string-length($comment/node())"/>
</doc>
</xsl:template>
</xsl:stylesheet>
Saxon produces <doc>4</doc>; MarkLogic <doc>3</doc>.
> However, if the combining solidus overlay character follows a greater-than character that is part of a start-tag:
>
> <comment≯</comment>
>
> then normalizing XML applications do not combine them:
>
> <comment>/</comment>
>
> There must be some W3C document which says, "The long solidus
> combining character shall not combine with the '>' in a start tag
> but it shall combine with the '>' if it is located elsewhere."
We don't need to say that. Recognizing markup boundaries takes place before
or, at the very latest, at the same time as entity expansion. There *is no*
">" in front of the ̸ character when the entity is expanded.
Be seeing you,
norm
--
Norman Walsh
Lead Engineer
MarkLogic Corporation
Phone: +1 512 761 6676
www.marklogic.com
Received on Monday, 18 February 2013 18:09:05 UTC