Re: Draft #1 response to: Is there a tool which tells me if my XML is "fully normalized"? from Daniel Veillard on 2013-03-12 (public-xml-core-wg@w3.org from March 2013)

From: Daniel Veillard <veillard@redhat.com>
Date: Tue, 12 Mar 2013 21:53:44 +0800
To: Paul Grosso <paul@paulgrosso.name>
Cc: core <public-xml-core-wg@w3.org>
Message-ID: <20130312135344.GI26720@redhat.com>
On Mon, Mar 11, 2013 at 12:44:14PM -0500, Paul Grosso wrote:
> Here is my first draft response to Roger about "fully normalized"
> XML. This is not my area of expertise, so please comment.
> 
> Daniel, does your parser include a user option to verify
> that the document is fully-normalized?

  no, that's something I didn't dive into, and nobody raised any
concern about it. One of the reasons I abstained commenting :-)

[...]
> >
> >-------- Original Message --------
> >Subject: Is there a tool which tells me if my XML is "fully normalized"?
> >Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000
> >Resent-From: xml-editor@w3.org
> >Date: Sat, 16 Feb 2013 22:56:36 +0000
> >From: Costello, Roger L. <costello@mitre.org>
> >To: xml-editor@w3.org <xml-editor@w3.org>
> >
> >
> >Hi Folks,
> 
> Hi Roger,
> 
> By way of generalities:
> 
> * As you know, the Character Model spec [1] defines and discusses
> fully-normalized text.
> 
> * The XML specifications mostly define what XML processors should
> and must do, and only occasionally suggest what XML applications
> should (but never must) do. I've tried to use these terms precisely
> in this response.
> 
> * XML 1.0 doesn't say anything about such normalization (the use of
> the word "normalization" in XML 1.0 is related to attribute value
> normalization which has nothing to do with Unicode normalization).
> 
> * XML 1.1 says [2] that the relevant constructs of all XML input
> should be fully normalized, and it lists the relevant constructs
> as those constructs in an XML document containing character data
> plus the constructs containing Names and Nmtokens. Note that this
> implies that markup is recognized before considering the normalization
> of the character content, so things like combining characters do not
> combine with markup characters as far as XML processors are concerned.
> 
> It does also say that:
> XML processors SHOULD provide a user option to verify that the
> document being processed is in fully normalized form, and report
> to the application whether it is or not.
> but we are not aware of any processor that currently provides such
> a user option.
> 
> Finally, it says that:
> XML processors MUST NOT transform the input to be in fully
> normalized form. XML applications that create XML 1.1 output
> from either XML 1.1 or XML 1.0 input SHOULD ensure that the
> output is fully normalized....
> 
> [1] http://www.w3.org/TR/charmod-norm/#sec-FullyNormalized
> [2] http://www.w3.org/TR/xml11/#sec-normalization-checking
> 
> >
> >1. Is there a tool which evaluates an XML document and returns an
> >indication of whether it is fully normalized or not?
> 
> We are not aware of any such tool, but if such a tool exists
> for "text files", it should apply equally to XML documents.

  I would assume ICU being the beast that it is has an option for it :-)
their web site seems down (or blocked from China) but
wikipedia suggest that is part of the toolkit:
  http://en.wikipedia.org/wiki/International_Components_for_Unicode
"ICU provides the following services: Unicode text handling, full
 character properties, ... Language sensitive collation and searching;
 normalization, upper and lowercase conversion, ..."

 but I never used it directly.



> Google found at [3] a mention of a project to add normalization
> checking to Xerces, but I could not find any definitive evidence
> that such a project was completed.
> 
> At [4], the CharMod spec lists some "freely available programming
> resources related to normalization".
> 
> [3] http://wiki.apache.org/general/SoC2009/RichardKelly-Xerces-NormalizationProposal
> [4] http://www.w3.org/TR/charmod-norm/#sec-n11n-resources
> 

indeed ICU is listed there,

Daniel
> >
> >2. This element:
> >
> ><comment>&#x338;</comment>
> >
> >is not fully normalized, right? (Since the content of the <comment>
> >element begins with a combining character and "content" is defined
> >to be a "relevant construct.") Note: hex 338 is the combining solidus
> >overlay character.
> 
> That element is fully normalized--see below.
> 
> >
> >3. Section 2.13 of the XML 1.1 specification says:
> >
> >XML applications that create XML 1.1 output from either XML 1.1 or
> >XML 1.0 input SHOULD ensure that the output is fully normalized
> >
> >What should an XML application output, given this
> >non-fully-normalized input:
> >
> ><comment>&#x0338;</comment>
> >
> >How does an XML application "ensure that the output is fully normalized"?
> 
> An application that produces
> 
> <comment>&#x0338;</comment>
> 
> has produced fully normalized output. There's nothing that isn't
> Unicode normalized about that sequence 27 characters.
> 
> An application that produced
> 
> <comment>X</comment>
> 
> where "X" is a single U0338 character would not be producing
> normalized output.
> 
> Note that the above quote from section 2.13 of XML 1.1 is talking
> about applications that create XML. In your question, you are
> asking what an application (that presumably will output XML) should
> do when given (presumably XML) input that is not fully normalized.
> So the application that produced the original non-normalized XML
> did something it "shouldn't" have done, and your question is what
> "should" the downstream application do about that.
> 
> No XML specification says anything about that, so the downstream
> application is free to do as it wishes. This is just like an XML
> editor that may adjust white space within character data or emit
> double quotes around attribute values where the input may have had
> single quotes, etc.
> 
> 
> >
> >4. If the combining solidus overlay character follows a greater-than
> >character in element content:
> >
> ><comment> &gt;&#x0338; </comment>
> >
> >then normalizing XML applications will combine them to create the
> >not-greater-than character:
> >
> ><comment> ? </comment>
> 
> As mentioned above, the input you show is normalized, so there
> are really two questions here:
> 
> 4a. What should an application do with:
> 
> <comment> &gt;&#x0338; </comment>
> 
> 4b. What should an application do with:
> 
> <comment> &gt;X </comment>
> 
> where X is the single U0338 character.
> 
> 4a isn't a normalization issue; 4b is. But as discussed under 3
> above, an application given either such input is free to do anything
> reasonable with either of those inputs.
> 
> Given 4a, we have found XML applications (e.g., Saxon) that produce:
> <comment> &gt;/ </comment>
> as well as those (e.g., MarkLogics, Arbortext Editor) that produce:
> <comment> ≯ </comment>
> 
> Similarly, given text input of "e&acute;", some XML editors
> write out é while others leave it as. (Arbortext Editor has
> an option setting to get either behavior.)
> 
> All such behaviors are allowable.
> 
> 
> >
> >However, if the combining solidus overlay character follows a
> >greater-than
> >character that is part of a start-tag:
> >
> ><comment>&#x0338;</comment>
> >
> >then normalizing XML applications do not combine them:
> >
> ><comment>/</comment>
> >
> >There must be some W3C document which says, "The long solidus combining
> >character shall not combine with the '>' in a start tag but it shall
> >combine with the '>' if it is located elsewhere."
> 
> Again, there are two questions:
> 
> 4c. What should an application do with:
> 
> <comment>&#x0338;</comment>
> 
> 4d. What should an application do with:
> 
> <comment>X</comment>
> 
> where X is the single U0338 character.
> 
> In the 4c case as you show above, there is no normalization issue.
> Recognizing markup boundaries takes place before--or, at the very
> latest, at the same time as--entity expansion. So there is no
> ">" in front of the &#x0338; when the entity is expanded.
> 
> In the 4d case, there is a normalization issue. But an XML
> processor MUST NOT normalize its input, so when an XML processor
> is handed 4d as input, it will recognize markup boundaries as
> usual so that the comment element will end up with character
> data content consisting of the single U0338 character which
> will have nothing with which to combine.
> 
> Note the previous paragraph talked about XML processors. In
> theory, an XML application could have a lexicographic layer--that
> preceded the parsing by the XML processor--in which normalization
> was done. In this case, the U0338 character would presumably be
> combined with the > resulting in
> 
> <comment≯</comment>
> 
> which would not be well-formed XML and would therefore
> presumably be rejected by the XML processor. While there is no
> W3C specification that forbids such behavior by an XML application,
> one would expect users of such an application to file bug reports
> or stop using such an application.
> 
> 
> 

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard@redhat.com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/
Received on Tuesday, 12 March 2013 13:54:22 UTC