- From: Norman Walsh <ndw@nwalsh.com>
- Date: Mon, 18 Feb 2013 12:08:34 -0600
- To: public-xml-core-wg@w3.org
- Message-ID: <m2sj4txzod.fsf@nwalsh.com>
Paul Grosso <paul@paulgrosso.name> writes: > -------- Original Message -------- > Subject: Is there a tool which tells me if my XML is "fully normalized"? > Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000 > Resent-From: xml-editor@w3.org > Date: Sat, 16 Feb 2013 22:56:36 +0000 > From: Costello, Roger L. <costello@mitre.org> > To: xml-editor@w3.org <xml-editor@w3.org> > > Hi Folks, > > 1. Is there a tool which evaluates an XML document and returns an > indication of whether it is fully normalized or not? Not that I'm aware of. But if such a tool exists for "text files", it applies equally to XML documents, I think. > 2. This element: > > <comment≯</comment> > > is not fully normalized, right? (Since the content of the <comment> > element begins with a combining character and "content" is defined > to be a "relevant construct.") Note: hex 338 is the combining > solidus overlay character. The content of that element is a single text node containing a single character that happens to be the combining solidus overlay character. I suppose you could ask "is the string content of that text node Unicode normalized?" but I don't think XML cares. > 3. Section 2.13 of the XML 1.1 specification says: > > XML applications that create XML 1.1 output from either XML 1.1 or > XML 1.0 input SHOULD ensure that the output is fully normalized > > What should an XML application output, given this non-fully-normalized input: > > <comment≯</comment> > > How does an XML application "ensure that the output is fully normalized"? I think a processor that outputs <comment≯</comment> *has* produced fully normalized output. There's nothing that isn't Unicode normalized about that sequence 27 characters. A processor that produced <comment>X</comment> where "X" is a single U0338 character would not be producing normalized output, IMHO. > 4. If the combining solidus overlay character follows a greater-than > character in element content: > > <comment> ≯ </comment> > > then normalizing XML applications will combine them to create the > not-greater-than character: > > <comment> ≯ </comment> That is an interesting question. Do we define "normalizing XML applications"? Where? Saxon and rxp produce <comment> >/ </comment> (where the combining slash applies to the semicolon on my display), but MarkLogic produces <comment> ≯ </comment> I don't know if both of those are conformant or not. The only reference I see to Unicode normalization in the XML Rec is in non-normative Appendix J. This stylesheet demonstrates that the problem is at the data model level: <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:output method="xml" encoding="utf-8" indent="no" omit-xml-declaration="yes"/> <xsl:variable name="comment" as="element()"> <comment> ≯ </comment> </xsl:variable> <xsl:template match="/"> <doc> <xsl:value-of select="string-length($comment/node())"/> </doc> </xsl:template> </xsl:stylesheet> Saxon produces <doc>4</doc>; MarkLogic <doc>3</doc>. > However, if the combining solidus overlay character follows a greater-than character that is part of a start-tag: > > <comment≯</comment> > > then normalizing XML applications do not combine them: > > <comment>/</comment> > > There must be some W3C document which says, "The long solidus > combining character shall not combine with the '>' in a start tag > but it shall combine with the '>' if it is located elsewhere." We don't need to say that. Recognizing markup boundaries takes place before or, at the very latest, at the same time as entity expansion. There *is no* ">" in front of the ̸ character when the entity is expanded. Be seeing you, norm -- Norman Walsh Lead Engineer MarkLogic Corporation Phone: +1 512 761 6676 www.marklogic.com
Received on Monday, 18 February 2013 18:09:05 UTC