Draft #1 response to: Is there a tool which tells me if my XML is "fully normalized"? from Paul Grosso on 2013-03-11 (public-xml-core-wg@w3.org from March 2013)

From: Paul Grosso <paul@paulgrosso.name>
Date: Mon, 11 Mar 2013 12:44:14 -0500
To: core <public-xml-core-wg@w3.org>
Message-ID: <513E17EE.9040009@paulgrosso.name>
Here is my first draft response to Roger about "fully normalized"
XML. This is not my area of expertise, so please comment.

Daniel, does your parser include a user option to verify
that the document is fully-normalized?

Henry, at http://www.w3.org/XML/2002/09/xml11-implementation
it says that RXP "incorporates code from Martin Duerst to
optionally check for Unicode character normalization." Is
there something we can say to Roger about this?

>
> -------- Original Message --------
> Subject: Is there a tool which tells me if my XML is "fully normalized"?
> Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000
> Resent-From: xml-editor@w3.org
> Date: Sat, 16 Feb 2013 22:56:36 +0000
> From: Costello, Roger L. <costello@mitre.org>
> To: xml-editor@w3.org <xml-editor@w3.org>
>
>
> Hi Folks,

Hi Roger,

By way of generalities:

* As you know, the Character Model spec [1] defines and discusses
fully-normalized text.

* The XML specifications mostly define what XML processors should
and must do, and only occasionally suggest what XML applications
should (but never must) do. I've tried to use these terms precisely
in this response.

* XML 1.0 doesn't say anything about such normalization (the use of
the word "normalization" in XML 1.0 is related to attribute value
normalization which has nothing to do with Unicode normalization).

* XML 1.1 says [2] that the relevant constructs of all XML input
should be fully normalized, and it lists the relevant constructs
as those constructs in an XML document containing character data
plus the constructs containing Names and Nmtokens. Note that this
implies that markup is recognized before considering the normalization
of the character content, so things like combining characters do not
combine with markup characters as far as XML processors are concerned.

It does also say that:
XML processors SHOULD provide a user option to verify that the
document being processed is in fully normalized form, and report
to the application whether it is or not.
but we are not aware of any processor that currently provides such
a user option.

Finally, it says that:
XML processors MUST NOT transform the input to be in fully
normalized form. XML applications that create XML 1.1 output
from either XML 1.1 or XML 1.0 input SHOULD ensure that the
output is fully normalized....

[1] http://www.w3.org/TR/charmod-norm/#sec-FullyNormalized
[2] http://www.w3.org/TR/xml11/#sec-normalization-checking

>
> 1. Is there a tool which evaluates an XML document and returns an
> indication of whether it is fully normalized or not?

We are not aware of any such tool, but if such a tool exists
for "text files", it should apply equally to XML documents.

Google found at [3] a mention of a project to add normalization
checking to Xerces, but I could not find any definitive evidence
that such a project was completed.

At [4], the CharMod spec lists some "freely available programming
resources related to normalization".

[3] 
http://wiki.apache.org/general/SoC2009/RichardKelly-Xerces-NormalizationProposal
[4] http://www.w3.org/TR/charmod-norm/#sec-n11n-resources

>
> 2. This element:
>
> <comment>&#x338;</comment>
>
> is not fully normalized, right? (Since the content of the <comment>
> element begins with a combining character and "content" is defined
> to be a "relevant construct.") Note: hex 338 is the combining solidus
> overlay character.

That element is fully normalized--see below.

>
> 3. Section 2.13 of the XML 1.1 specification says:
>
> XML applications that create XML 1.1 output from either XML 1.1 or
> XML 1.0 input SHOULD ensure that the output is fully normalized
>
> What should an XML application output, given this non-fully-normalized 
> input:
>
> <comment>&#x0338;</comment>
>
> How does an XML application "ensure that the output is fully normalized"?

An application that produces

<comment>&#x0338;</comment>

has produced fully normalized output. There's nothing that isn't
Unicode normalized about that sequence 27 characters.

An application that produced

<comment>X</comment>

where "X" is a single U0338 character would not be producing
normalized output.

Note that the above quote from section 2.13 of XML 1.1 is talking
about applications that create XML. In your question, you are
asking what an application (that presumably will output XML) should
do when given (presumably XML) input that is not fully normalized.
So the application that produced the original non-normalized XML
did something it "shouldn't" have done, and your question is what
"should" the downstream application do about that.

No XML specification says anything about that, so the downstream
application is free to do as it wishes. This is just like an XML
editor that may adjust white space within character data or emit
double quotes around attribute values where the input may have had
single quotes, etc.


>
> 4. If the combining solidus overlay character follows a greater-than
> character in element content:
>
> <comment> &gt;&#x0338; </comment>
>
> then normalizing XML applications will combine them to create the
> not-greater-than character:
>
> <comment> ? </comment>

As mentioned above, the input you show is normalized, so there
are really two questions here:

4a. What should an application do with:

<comment> &gt;&#x0338; </comment>

4b. What should an application do with:

<comment> &gt;X </comment>

where X is the single U0338 character.

4a isn't a normalization issue; 4b is. But as discussed under 3
above, an application given either such input is free to do anything
reasonable with either of those inputs.

Given 4a, we have found XML applications (e.g., Saxon) that produce:
<comment> &gt;/ </comment>
as well as those (e.g., MarkLogics, Arbortext Editor) that produce:
<comment> ≯ </comment>

Similarly, given text input of "e&acute;", some XML editors
write out é while others leave it as. (Arbortext Editor has
an option setting to get either behavior.)

All such behaviors are allowable.


>
> However, if the combining solidus overlay character follows a 
> greater-than
> character that is part of a start-tag:
>
> <comment>&#x0338;</comment>
>
> then normalizing XML applications do not combine them:
>
> <comment>/</comment>
>
> There must be some W3C document which says, "The long solidus combining
> character shall not combine with the '>' in a start tag but it shall
> combine with the '>' if it is located elsewhere."

Again, there are two questions:

4c. What should an application do with:

<comment>&#x0338;</comment>

4d. What should an application do with:

<comment>X</comment>

where X is the single U0338 character.

In the 4c case as you show above, there is no normalization issue.
Recognizing markup boundaries takes place before--or, at the very
latest, at the same time as--entity expansion. So there is no
">" in front of the &#x0338; when the entity is expanded.

In the 4d case, there is a normalization issue. But an XML
processor MUST NOT normalize its input, so when an XML processor
is handed 4d as input, it will recognize markup boundaries as
usual so that the comment element will end up with character
data content consisting of the single U0338 character which
will have nothing with which to combine.

Note the previous paragraph talked about XML processors. In
theory, an XML application could have a lexicographic layer--that
preceded the parsing by the XML processor--in which normalization
was done. In this case, the U0338 character would presumably be
combined with the > resulting in

<comment≯</comment>

which would not be well-formed XML and would therefore
presumably be rejected by the XML processor. While there is no
W3C specification that forbids such behavior by an XML application,
one would expect users of such an application to file bug reports
or stop using such an application.
Received on Monday, 11 March 2013 17:44:44 UTC