Discrepancy between XML1.1 and Character Model specifications from Alexey Neyman on 2015-05-02 (xml-editor@w3.org from April to June 2015)

From: Alexey Neyman <stilor@att.net>
Date: Fri, 01 May 2015 18:26:34 -0700
To: xml-editor@w3.org, www-i18n-comments@w3.org, www-international@w3.org
Message-ID: <554427CA.6040808@att.net>

Hi,

I have encountered what looks like a discrepancy between the XML 1.1 
specification [1] and the XML-based examples in the "Character Model for 
the World Wide Web 1.0: Normalization" [2]. I noticed that the latest 
version of [2] has been renamed from "Normalization" to "String Matching 
and Searching" [3]; that new version does not have the problematic 
examples section at all (neither does not have it the definitions of 
Unicode-/include-/full normalization). I do not know if that section has 
been dropped or is still being transferred from [2]. In the latter case, 
this discrepancy may still be applicable.

The issue is as follows: [2] requires language specification to define 
relevant constructs and, if the language has any include mechanism, 
entity boundaries (C304). XML 1.1 specification does define relevant 
constructs, but does not have an explicit definition of what constitutes 
inclusion. The link for the 'include' term (e.g. from the definition of 
'include-normalized' in the normative appendix B) points to the section 
4.4.2, "Included", which describes the entity inclusion. I could not 
find any language in [1] which suggests that CDATA sections are 
considered an inclusion mechanism.

The table with XML examples in section 3.3.2, however, assumes CDATA 
sections are also considered language "include" mechanism: the 3rd row 
from the bottom has the text "suc<![CDATA[,on]]>" (I replaced cedilla 
with a regular comma so that it's displayed properly in most email 
clients) listed as not include-normalized - which means, reversing the 
definition in 3.2.3, that "the text contains character escapes or 
includes whose expansion would cause the text to become no longer 
Unicode-normalized", thus implying that the CDATA section is an 'include'.

I think this needs  to be remedied in one of the two ways:
- The XML 1.1 specification [1] can be changed to define the term 
'include' to apply to both the entities replaced with their replacement 
text and to the CDATA section content.
- The above mentioned example in [2] can be corrected to describe that 
string as "Unicode-normalized, include-normalized, NOT fully normalized".

I think the first approach would be more appropriate, given that XML 
Information Set specification [4] considers character information items 
equally, regardless of whence they came, be it from a CharData 
production, character/entity reference, or CDATA section.

Regards,
Alexey.

[1] http://www.w3.org/TR/2006/REC-xml11-20060816/
[2] http://www.w3.org/TR/2005/WD-charmod-norm-20051027/
[3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/
[4] http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.character

Received on Saturday, 2 May 2015 01:30:26 UTC