- From: Daniel Veillard <veillard@redhat.com>
- Date: Tue, 12 Mar 2013 21:53:44 +0800
- To: Paul Grosso <paul@paulgrosso.name>
- Cc: core <public-xml-core-wg@w3.org>
On Mon, Mar 11, 2013 at 12:44:14PM -0500, Paul Grosso wrote: > Here is my first draft response to Roger about "fully normalized" > XML. This is not my area of expertise, so please comment. > > Daniel, does your parser include a user option to verify > that the document is fully-normalized? no, that's something I didn't dive into, and nobody raised any concern about it. One of the reasons I abstained commenting :-) [...] > > > >-------- Original Message -------- > >Subject: Is there a tool which tells me if my XML is "fully normalized"? > >Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000 > >Resent-From: xml-editor@w3.org > >Date: Sat, 16 Feb 2013 22:56:36 +0000 > >From: Costello, Roger L. <costello@mitre.org> > >To: xml-editor@w3.org <xml-editor@w3.org> > > > > > >Hi Folks, > > Hi Roger, > > By way of generalities: > > * As you know, the Character Model spec [1] defines and discusses > fully-normalized text. > > * The XML specifications mostly define what XML processors should > and must do, and only occasionally suggest what XML applications > should (but never must) do. I've tried to use these terms precisely > in this response. > > * XML 1.0 doesn't say anything about such normalization (the use of > the word "normalization" in XML 1.0 is related to attribute value > normalization which has nothing to do with Unicode normalization). > > * XML 1.1 says [2] that the relevant constructs of all XML input > should be fully normalized, and it lists the relevant constructs > as those constructs in an XML document containing character data > plus the constructs containing Names and Nmtokens. Note that this > implies that markup is recognized before considering the normalization > of the character content, so things like combining characters do not > combine with markup characters as far as XML processors are concerned. > > It does also say that: > XML processors SHOULD provide a user option to verify that the > document being processed is in fully normalized form, and report > to the application whether it is or not. > but we are not aware of any processor that currently provides such > a user option. > > Finally, it says that: > XML processors MUST NOT transform the input to be in fully > normalized form. XML applications that create XML 1.1 output > from either XML 1.1 or XML 1.0 input SHOULD ensure that the > output is fully normalized.... > > [1] http://www.w3.org/TR/charmod-norm/#sec-FullyNormalized > [2] http://www.w3.org/TR/xml11/#sec-normalization-checking > > > > >1. Is there a tool which evaluates an XML document and returns an > >indication of whether it is fully normalized or not? > > We are not aware of any such tool, but if such a tool exists > for "text files", it should apply equally to XML documents. I would assume ICU being the beast that it is has an option for it :-) their web site seems down (or blocked from China) but wikipedia suggest that is part of the toolkit: http://en.wikipedia.org/wiki/International_Components_for_Unicode "ICU provides the following services: Unicode text handling, full character properties, ... Language sensitive collation and searching; normalization, upper and lowercase conversion, ..." but I never used it directly. > Google found at [3] a mention of a project to add normalization > checking to Xerces, but I could not find any definitive evidence > that such a project was completed. > > At [4], the CharMod spec lists some "freely available programming > resources related to normalization". > > [3] http://wiki.apache.org/general/SoC2009/RichardKelly-Xerces-NormalizationProposal > [4] http://www.w3.org/TR/charmod-norm/#sec-n11n-resources > indeed ICU is listed there, Daniel > > > >2. This element: > > > ><comment≯</comment> > > > >is not fully normalized, right? (Since the content of the <comment> > >element begins with a combining character and "content" is defined > >to be a "relevant construct.") Note: hex 338 is the combining solidus > >overlay character. > > That element is fully normalized--see below. > > > > >3. Section 2.13 of the XML 1.1 specification says: > > > >XML applications that create XML 1.1 output from either XML 1.1 or > >XML 1.0 input SHOULD ensure that the output is fully normalized > > > >What should an XML application output, given this > >non-fully-normalized input: > > > ><comment≯</comment> > > > >How does an XML application "ensure that the output is fully normalized"? > > An application that produces > > <comment≯</comment> > > has produced fully normalized output. There's nothing that isn't > Unicode normalized about that sequence 27 characters. > > An application that produced > > <comment>X</comment> > > where "X" is a single U0338 character would not be producing > normalized output. > > Note that the above quote from section 2.13 of XML 1.1 is talking > about applications that create XML. In your question, you are > asking what an application (that presumably will output XML) should > do when given (presumably XML) input that is not fully normalized. > So the application that produced the original non-normalized XML > did something it "shouldn't" have done, and your question is what > "should" the downstream application do about that. > > No XML specification says anything about that, so the downstream > application is free to do as it wishes. This is just like an XML > editor that may adjust white space within character data or emit > double quotes around attribute values where the input may have had > single quotes, etc. > > > > > >4. If the combining solidus overlay character follows a greater-than > >character in element content: > > > ><comment> ≯ </comment> > > > >then normalizing XML applications will combine them to create the > >not-greater-than character: > > > ><comment> ? </comment> > > As mentioned above, the input you show is normalized, so there > are really two questions here: > > 4a. What should an application do with: > > <comment> ≯ </comment> > > 4b. What should an application do with: > > <comment> >X </comment> > > where X is the single U0338 character. > > 4a isn't a normalization issue; 4b is. But as discussed under 3 > above, an application given either such input is free to do anything > reasonable with either of those inputs. > > Given 4a, we have found XML applications (e.g., Saxon) that produce: > <comment> >/ </comment> > as well as those (e.g., MarkLogics, Arbortext Editor) that produce: > <comment> ≯ </comment> > > Similarly, given text input of "e´", some XML editors > write out é while others leave it as. (Arbortext Editor has > an option setting to get either behavior.) > > All such behaviors are allowable. > > > > > >However, if the combining solidus overlay character follows a > >greater-than > >character that is part of a start-tag: > > > ><comment≯</comment> > > > >then normalizing XML applications do not combine them: > > > ><comment>/</comment> > > > >There must be some W3C document which says, "The long solidus combining > >character shall not combine with the '>' in a start tag but it shall > >combine with the '>' if it is located elsewhere." > > Again, there are two questions: > > 4c. What should an application do with: > > <comment≯</comment> > > 4d. What should an application do with: > > <comment>X</comment> > > where X is the single U0338 character. > > In the 4c case as you show above, there is no normalization issue. > Recognizing markup boundaries takes place before--or, at the very > latest, at the same time as--entity expansion. So there is no > ">" in front of the ̸ when the entity is expanded. > > In the 4d case, there is a normalization issue. But an XML > processor MUST NOT normalize its input, so when an XML processor > is handed 4d as input, it will recognize markup boundaries as > usual so that the comment element will end up with character > data content consisting of the single U0338 character which > will have nothing with which to combine. > > Note the previous paragraph talked about XML processors. In > theory, an XML application could have a lexicographic layer--that > preceded the parsing by the XML processor--in which normalization > was done. In this case, the U0338 character would presumably be > combined with the > resulting in > > <comment≯</comment> > > which would not be well-formed XML and would therefore > presumably be rejected by the XML processor. While there is no > W3C specification that forbids such behavior by an XML application, > one would expect users of such an application to file bug reports > or stop using such an application. > > > -- Daniel Veillard | Open Source and Standards, Red Hat veillard@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/
Received on Tuesday, 12 March 2013 13:54:22 UTC