Name characters and normalization from MURATA Makoto on 2002-01-24 (www-xml-blueberry-comments@w3.org from January 2002)

From: MURATA Makoto <mmurata@trl.ibm.co.jp>
Date: Thu, 24 Jan 2002 16:02:42 +0900 (JST)
To: www-xml-blueberry-comments@w3.org
cc: mmurata@trl.ibm.co.jp
Message-Id: <20020124.160242.42938481.mmurata@trl.ibm.com>

In the current draft of XML 1.1, many characters affected by character
normalization are allowed as name characters.  I find that
normalization applied to XML documents may lead to strange results,
when such characters are used as tag names, entity names, or
identifiers.  I do not know if this is a problem.  I am just reporting
what could happen.

Hereafter, I use \uXXXX so as to represent a Unicode character 
of the cope point XXXX.  The following examples can be converted 
to UTF-8 by using

	native2ascii -reverse -encoding utf-8

Case 1: Wellformed -> non-well-formed

First, some well-formed documents become non-well-formed.

Example 1: 

<?xml version="1.0"?>
<test \u00c1="" \u0041\u0301=""/>

Well, this example may be too artificial.  But the next one 
is more realistic.

Example 2:

<?xml version="1.0"?>
<!DOCTYPE test SYSTEM "welltoill2.dtd">
<test \u00c1=""/>

where welltoill2.dtd is:

<!ATTLIST test \u0041\u0301 CDATA "default">

Case 2: Non-well-formed -> Well-formed

Next, some non-well-formed documents become well-formed documents.

Example 3:

<?xml version="1.0"?>
<\u00c1></\u0041\u0301>

Example 4:

<?xml version="1.0"?>
<!DOCTYPE t [
<!ENTITY \u00c1 "">
]>
<t>
&\u0041\u0301;
</t>

Both examples are artificial.  But if the entity declaration is 
moved to an external DTD subset, Example 4 becomes more realistic.

Case 3: valid -> invalid

Next, some valid documents become invalid.

Example 5:

<?xml version="1.0"?>
<!DOCTYPE test [
<!ELEMENT test (p*)>
<!ELEMENT p EMPTY>
<!ATTLIST p id ID #REQUIRED>
<test>
  <p id="\u00c1"/>
  <p id="\u0041\u0301"/>
</test>

Again, this example is artifical.  But if we create an external 
parsed entity containing the second <p> only, it becomes more 
realistic.

Case 4: invalid -> valid

Next, some invalid documents become valid.

Example 6:

<?xml version="1.0"?>
<!DOCTYPE \u00c1 [
<!ELEMENT \u00c1 EMPTY>
]>
<\u0041\u0301/>

Again, the DTD can be made external.

Example 7:

<?xml version="1.0"?>
<!DOCTYPE t SYSTEM "invalidtovalid2.dtd">
<t>
&\u0041\u0301;
</t>

where invalidtovalid2.dtd is:

<!ENTITY \u00c1 "">


Case 5: well-formed -> well-formed with a different infoset.

Finally, some well-formed documents remain well-formed, 
but the infoset become different.

Example 8:

<?xml version="1.0"?>
<!DOCTYPE t [
<!ENTITY \u00c1          "first">
<!ENTITY \u0041\u0301    "second">
]>
<t>
&\u0041\u0301;
</t>


I know that many W3C specs assume early normalization.  One could thus
argue that we do not have to worry about examples shown above.
However, it is not hard to imagine that some (many?) XML documents are
not normalized.  I am just worried (but do not know) if name
characters of XML 1.1 make the situation worse.

Cheers,

Makoto

Received on Thursday, 24 January 2002 02:07:46 UTC