Discrepancy between XML1.1 and Character Model specifications from Richard Ishida on 2015-05-06 (www-international@w3.org from April to June 2015)

From: Richard Ishida <ishida@w3.org>
Date: Wed, 06 May 2015 14:14:39 +0100
To: www International <www-international@w3.org>
Message-ID: <554A13BF.80602@w3.org>

-------- Forwarded Message --------
Date: Sat, 02 May 2015 01:30:10 +0000
From: Alexey Neyman <stilor@att.net>
To: xml-editor@w3.org, www-i18n-comments@w3.org, www-international@w3.org

Hi,

I have encountered what looks like a discrepancy between the XML 1.1
specification [1] and the XML-based examples in the "Character Model for
the World Wide Web 1.0: Normalization" [2]. I noticed that the latest
version of [2] has been renamed from "Normalization" to "String Matching
and Searching" [3]; that new version does not have the problematic
examples section at all (neither does not have it the definitions of
Unicode-/include-/full normalization). I do not know if that section has
been dropped or is still being transferred from [2]. In the latter case,
this discrepancy may still be applicable.

The issue is as follows: [2] requires language specification to define
relevant constructs and, if the language has any include mechanism,
entity boundaries (C304). XML 1.1 specification does define relevant
constructs, but does not have an explicit definition of what constitutes
inclusion. The link for the 'include' term (e.g. from the definition of
'include-normalized' in the normative appendix B) points to the section
4.4.2, "Included", which describes the entity inclusion. I could not
find any language in [1] which suggests that CDATA sections are
considered an inclusion mechanism.

The table with XML examples in section 3.3.2, however, assumes CDATA
sections are also considered language "include" mechanism: the 3rd row
from the bottom has the text "suc<![CDATA[,on]]>" (I replaced cedilla
with a regular comma so that it's displayed properly in most email
clients) listed as not include-normalized - which means, reversing the
definition in 3.2.3, that "the text contains character escapes or
includes whose expansion would cause the text to become no longer
Unicode-normalized", thus implying that the CDATA section is an 'include'.

I think this needs  to be remedied in one of the two ways:
- The XML 1.1 specification [1] can be changed to define the term
'include' to apply to both the entities replaced with their replacement
text and to the CDATA section content.
- The above mentioned example in [2] can be corrected to describe that
string as "Unicode-normalized, include-normalized, NOT fully normalized".

I think the first approach would be more appropriate, given that XML
Information Set specification [4] considers character information items
equally, regardless of whence they came, be it from a CharData
production, character/entity reference, or CDATA section.

Regards,
Alexey.

[1] http://www.w3.org/TR/2006/REC-xml11-20060816/
[2] http://www.w3.org/TR/2005/WD-charmod-norm-20051027/
[3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/
[4] http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.character

Received on Wednesday, 6 May 2015 13:14:43 UTC