- From: Paul Grosso <paul@paulgrosso.name>
- Date: Sat, 02 May 2015 08:32:34 -0500
- To: core <public-xml-core-wg@w3.org>
- Message-ID: <5544D1F2.2070406@paulgrosso.name>
FYI. -------- Forwarded Message -------- Subject: Discrepancy between XML1.1 and Character Model specifications Resent-Date: Sat, 02 May 2015 01:30:30 +0000 Resent-From: xml-editor@w3.org Date: Fri, 01 May 2015 18:26:34 -0700 From: Alexey Neyman <stilor@att.net> To: xml-editor@w3.org, www-i18n-comments@w3.org, www-international@w3.org Hi, I have encountered what looks like a discrepancy between the XML 1.1 specification [1] and the XML-based examples in the "Character Model for the World Wide Web 1.0: Normalization" [2]. I noticed that the latest version of [2] has been renamed from "Normalization" to "String Matching and Searching" [3]; that new version does not have the problematic examples section at all (neither does not have it the definitions of Unicode-/include-/full normalization). I do not know if that section has been dropped or is still being transferred from [2]. In the latter case, this discrepancy may still be applicable. The issue is as follows: [2] requires language specification to define relevant constructs and, if the language has any include mechanism, entity boundaries (C304). XML 1.1 specification does define relevant constructs, but does not have an explicit definition of what constitutes inclusion. The link for the 'include' term (e.g. from the definition of 'include-normalized' in the normative appendix B) points to the section 4.4.2, "Included", which describes the entity inclusion. I could not find any language in [1] which suggests that CDATA sections are considered an inclusion mechanism. The table with XML examples in section 3.3.2, however, assumes CDATA sections are also considered language "include" mechanism: the 3rd row from the bottom has the text "suc<![CDATA[,on]]>" (I replaced cedilla with a regular comma so that it's displayed properly in most email clients) listed as not include-normalized - which means, reversing the definition in 3.2.3, that "the text contains character escapes or includes whose expansion would cause the text to become no longer Unicode-normalized", thus implying that the CDATA section is an 'include'. I think this needs to be remedied in one of the two ways: - The XML 1.1 specification [1] can be changed to define the term 'include' to apply to both the entities replaced with their replacement text and to the CDATA section content. - The above mentioned example in [2] can be corrected to describe that string as "Unicode-normalized, include-normalized, NOT fully normalized". I think the first approach would be more appropriate, given that XML Information Set specification [4] considers character information items equally, regardless of whence they came, be it from a CharData production, character/entity reference, or CDATA section. Regards, Alexey. [1] http://www.w3.org/TR/2006/REC-xml11-20060816/ [2] http://www.w3.org/TR/2005/WD-charmod-norm-20051027/ [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/ [4] http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.character
Received on Saturday, 2 May 2015 13:33:10 UTC