- From: Paul Grosso <paul@paulgrosso.name>
- Date: Thu, 21 Mar 2013 09:17:27 -0500
- To: "Costello, Roger L." <costello@mitre.org>, "xml-editor@w3.org" <xml-editor@w3.org>
> > -------- Original Message -------- > Subject: Is there a tool which tells me if my XML is "fully normalized"? > Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000 > Resent-From: xml-editor@w3.org > Date: Sat, 16 Feb 2013 22:56:36 +0000 > From: Costello, Roger L. <costello@mitre.org> > To: xml-editor@w3.org <xml-editor@w3.org> > > > Hi Folks, Hi Roger, By way of generalities: * As you know, the Character Model spec [1] defines and discusses fully-normalized text. * The XML specifications mostly define what XML *processors* should and must do, and only occasionally suggest what XML *applications* should (but never must) do. I've tried to use these terms precisely in this response. * XML 1.0 doesn't say anything about such normalization (the use of the word "normalization" in XML 1.0 is related to attribute value normalization which has nothing to do with Unicode normalization). * XML 1.1 says [2] that the relevant constructs of all XML input should be fully normalized, and it lists the relevant constructs as those constructs in an XML document containing character data plus the constructs containing Names and Nmtokens. Note that this implies that markup is recognized before considering the normalization of the character content, so things like combining characters do not combine with markup characters as far as XML processors are concerned. It does also say that: XML processors SHOULD provide a user option to verify that the document being processed is in fully normalized form, and report to the application whether it is or not. The only processor of which we are aware that currently provides such a user option is the RXP processor (more detail below). Finally, it says that: XML processors MUST NOT transform the input to be in fully normalized form. XML applications that create XML 1.1 output from either XML 1.1 or XML 1.0 input SHOULD ensure that the output is fully normalized.... [1] http://www.w3.org/TR/charmod-norm/#sec-FullyNormalized [2] http://www.w3.org/TR/xml11/#sec-normalization-checking > > 1. Is there a tool which evaluates an XML document and returns an > indication of whether it is fully normalized or not? The RXP processor [3] (Unix man page at [4]) can optionally check whether an XML 1.1 document is fully normalize. It has a -U flag that controls Unicode normalization checking, but this flag is only relevant when parsing XML 1.1 documents. If it is 0, no checking is done. If it is 1, RXP checks that the document is fully normalized as defined by the W3C character model. If it is 2, the document is checked and any unknown characters (which may be ones corresponding to a newer version of Unicode than RXP knows about) will also cause an error. Google found at [5] a mention of a project to add normalization checking to Xerces, but I could not find any definitive evidence that such a project was completed. At [6], the CharMod spec lists some "freely available programming resources related to normalization". [3] http://www.cogsci.ed.ac.uk/~richard/rxp.html [4] http://www.cogsci.ed.ac.uk/~richard/rxp.txt [5] http://wiki.apache.org/general/SoC2009/RichardKelly-Xerces-NormalizationProposal [6] http://www.w3.org/TR/charmod-norm/#sec-n11n-resources > > 2. This element: > > <comment≯</comment> > > is not fully normalized, right? (Since the content of the <comment> > element begins with a combining character and "content" is defined > to be a "relevant construct.") Note: hex 338 is the combining solidus > overlay character. That element is fully normalized--see below. > > 3. Section 2.13 of the XML 1.1 specification says: > > XML applications that create XML 1.1 output from either XML 1.1 or > XML 1.0 input SHOULD ensure that the output is fully normalized > > What should an XML application output, given this non-fully-normalized > input: > > <comment≯</comment> > > How does an XML application "ensure that the output is fully normalized"? An application that produces <comment≯</comment> has produced fully normalized output. There's nothing that isn't Unicode normalized about that sequence 27 characters. An application that produced <comment>X</comment> where "X" is a single U0338 character would not be producing normalized output. Note that the above quote from section 2.13 of XML 1.1 is talking about applications that create XML. In your question, you are asking what an application (that presumably will output XML) should do when given (presumably XML) input that is not fully normalized. So the application that produced the original non-normalized XML did something it "shouldn't" have done, and your question is what "should" the downstream application do about that. No XML specification says anything about that, so the downstream application is free to do as it wishes. This is just like an XML editor that may adjust white space within character data or emit double quotes around attribute values where the input may have had single quotes, etc. > > 4. If the combining solidus overlay character follows a greater-than > character in element content: > > <comment> ≯ </comment> > > then normalizing XML applications will combine them to create the > not-greater-than character: > > <comment> ≯ </comment> As mentioned above, the input you show is normalized, so there are really two questions here: 4a. What should an application do with: <comment> ≯ </comment> 4b. What should an application do with: <comment> >X </comment> where X is the single U0338 character. 4a isn't a normalization issue; 4b is. But as discussed under 3 above, an application given either such input is free to do anything reasonable with either of those inputs. Given 4a, we have found XML applications (e.g., Saxon) that produce: <comment> >/ </comment> as well as those (e.g., MarkLogics, Arbortext Editor) that produce: <comment> ≯ </comment> Similarly, given text input of "e´", some XML editors write out é while others leave it as e´ (an e followed by the individual acute character). (Arbortext Editor has an option setting to get either behavior.) All such behaviors are allowable. > > However, if the combining solidus overlay character follows a > greater-than > character that is part of a start-tag: > > <comment≯</comment> > > then normalizing XML applications do not combine them: > > <comment>/</comment> > > There must be some W3C document which says, "The long solidus combining > character shall not combine with the '>' in a start tag but it shall > combine with the '>' if it is located elsewhere." Again, there are two questions: 4c. What should an application do with: <comment≯</comment> 4d. What should an application do with: <comment>X</comment> where X is the single U0338 character. In the 4c case as you show above, there is no normalization issue. Recognizing markup boundaries takes place before--or, at the very latest, at the same time as--entity expansion. So there is no ">" in front of the ̸ when the entity is expanded. In the 4d case, there is a normalization issue. But an XML processor MUST NOT normalize its input, so when an XML processor is handed 4d as input, it will recognize markup boundaries as usual so that the comment element will end up with character data content consisting of the single U0338 character which will have nothing with which to combine. Note the previous paragraph talked about XML processors. In theory, an XML application could have a lexicographic layer--that preceded the parsing by the XML processor--in which normalization was done. In this case, the U0338 character would presumably be combined with the > resulting in <comment≯</comment> which would not be well-formed XML and would therefore presumably be rejected by the XML processor. While there is no W3C specification that forbids such behavior by an XML application, one would expect users of such an application to file bug reports or stop using such an application.
Received on Thursday, 21 March 2013 14:17:55 UTC