Please clarify how to handle "control characters" from Tolkin, Steve on 2001-02-07 (www-i18n-comments@w3.org from February 2001)

From: Tolkin, Steve <Steve.Tolkin@FMR.COM>
Date: Wed, 7 Feb 2001 12:54:00 -0500
To: "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
Cc: w3c-xml-query-wg@w3.org
Message-ID: <4EDD23A3F6B4D411B7DF00A0C9DD5B560AE90F@MSGBOS626NTS.fmr.com>

Certain Unicode "characters" have the same hexadecimal value 
as the ASCII control characters.

For the purposes of this email I use the term "control character" 
to mean certain special code points in Unicode.  
Examples of "control characters" are U+0000 to U+001F inclusive, 
except U+0009, U+000A, and U+000D.

Please clarify the proper way to handle these, e.g. with respect to 
string normalization.

Specifically, in Character Model for the World Wide Web 1.0
         W3C Working Draft 26 January 2001
         This version: http://www.w3.org/TR/2001/WD-charmod-20010126
         Latest version: http://www.w3.org/TR/charmod/
section 3.5 states:
The specification MUST NOT arbitrarily restrict the range of characters
that can be used, which must cover all Unicode code points from 0 to
0x10FFFF inclusive.

In contrast
Extensible Markup Language (XML) 1.0 (Second Edition)
         W3C Recommendation 6 October 2000
         This version: http://www.w3.org/TR/2000/REC-xml-20001006
         Latest version: http://www.w3.org/TR/REC-xml 
section 2.2 states:
Consequently, XML processors must accept any character in the range
specified for Char. ...
Char ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */

The above implies that a processor may reject a document containing
a control character.  In fact some processors do that, raising an error
that the document is not well formed.


The character model should be clear about how the "control
characters" behave with respect to
string normalization.  Must they be left alone? 
May they be deleted by a conforming processsor? 
Or should each one be replaced by a space, and further normalized?

Or perhaps the Character Model specification should explicitly 
state that this decision is in the scope of the application.

 
Hopefully helpfully yours,
Steve
-- 
Steven Tolkin          steve.tolkin@fmr.com      617-563-0516 
Fidelity Investments   82 Devonshire St. V10D    Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.

Received on Wednesday, 7 February 2001 13:08:26 UTC