- From: Paul Grosso <paul@paulgrosso.name>
- Date: Wed, 20 Feb 2013 11:09:39 -0600
- To: public-xml-core-wg@w3.org
Norm et al., Something you said below confuses me. Under 2, you say "The content of that element is a single text node containing a single character", but then under 3, you say "There's nothing that isn't Unicode normalized about that sequence [of] 27 characters." But "27 characters" implies "̸" is 8 characters, not "a single text node containing a single character". Can you unravel my confusion? paul On 2013-02-18 12:08, Norman Walsh wrote: > Paul Grosso <paul@paulgrosso.name> writes: >> -------- Original Message -------- >> Subject: Is there a tool which tells me if my XML is "fully normalized"? >> Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000 >> Resent-From: xml-editor@w3.org >> Date: Sat, 16 Feb 2013 22:56:36 +0000 >> From: Costello, Roger L. <costello@mitre.org> >> To: xml-editor@w3.org <xml-editor@w3.org> >> >> Hi Folks, >> >> 1. Is there a tool which evaluates an XML document and returns an >> indication of whether it is fully normalized or not? > Not that I'm aware of. But if such a tool exists for "text files", it > applies equally to XML documents, I think. > >> 2. This element: >> >> <comment≯</comment> >> >> is not fully normalized, right? (Since the content of the <comment> >> element begins with a combining character and "content" is defined >> to be a "relevant construct.") Note: hex 338 is the combining >> solidus overlay character. > The content of that element is a single text node containing a single > character that happens to be the combining solidus overlay character. > I suppose you could ask "is the string content of that text node Unicode > normalized?" but I don't think XML cares. > >> 3. Section 2.13 of the XML 1.1 specification says: >> >> XML applications that create XML 1.1 output from either XML 1.1 or >> XML 1.0 input SHOULD ensure that the output is fully normalized >> >> What should an XML application output, given this non-fully-normalized input: >> >> <comment≯</comment> >> >> How does an XML application "ensure that the output is fully normalized"? > I think a processor that outputs > > <comment≯</comment> > > *has* produced fully normalized output. There's nothing that isn't Unicode > normalized about that sequence 27 characters. > > A processor that produced > > <comment>X</comment> > > where "X" is a single U0338 character would not be producing > normalized output, IMHO. > >> 4. If the combining solidus overlay character follows a greater-than >> character in element content: >> >> <comment> ≯ </comment> >> >> then normalizing XML applications will combine them to create the >> not-greater-than character: >> >> <comment> ≯ </comment> > That is an interesting question. Do we define "normalizing XML > applications"? Where? > > Saxon and rxp produce > > <comment> >/ </comment> > > (where the combining slash applies to the semicolon on my display), > but MarkLogic produces > > <comment> ≯ </comment> > > I don't know if both of those are conformant or not. The only > reference I see to Unicode normalization in the XML Rec is in > non-normative Appendix J. > > This stylesheet demonstrates that the problem is at the data model level: > > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > xmlns:xs="http://www.w3.org/2001/XMLSchema" > exclude-result-prefixes="xs" > version="2.0"> > > <xsl:output method="xml" encoding="utf-8" indent="no" > omit-xml-declaration="yes"/> > > <xsl:variable name="comment" as="element()"> > <comment> ≯ </comment> > </xsl:variable> > > <xsl:template match="/"> > <doc> > <xsl:value-of select="string-length($comment/node())"/> > </doc> > </xsl:template> > > </xsl:stylesheet> > > Saxon produces <doc>4</doc>; MarkLogic <doc>3</doc>. > >> However, if the combining solidus overlay character follows a greater-than character that is part of a start-tag: >> >> <comment≯</comment> >> >> then normalizing XML applications do not combine them: >> >> <comment>/</comment> >> >> There must be some W3C document which says, "The long solidus >> combining character shall not combine with the '>' in a start tag >> but it shall combine with the '>' if it is located elsewhere." > We don't need to say that. Recognizing markup boundaries takes place before > or, at the very latest, at the same time as entity expansion. There *is no* > ">" in front of the ̸ character when the entity is expanded. > > Be seeing you, > norm > > -- > Norman Walsh > Lead Engineer > MarkLogic Corporation > Phone: +1 512 761 6676 > www.marklogic.com
Received on Wednesday, 20 February 2013 17:10:21 UTC