Re: Please clarify how to handle "control characters" from Martin Duerst on 2001-02-27 (www-i18n-comments@w3.org from February 2001)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 28 Feb 2001 07:42:35 +0900
To: "Tolkin, Steve" <Steve.Tolkin@FMR.COM>, "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
Cc: w3c-xml-query-wg@w3.org, w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20010228070406.038acda0@sh.w3.mag.keio.ac.jp>

Hello Steve,

Many thanks for your comments.
We just discussed them in the WG meeting.

At 12:54 01/02/07 -0500, Tolkin, Steve wrote:
>Certain Unicode "characters" have the same hexadecimal value
>as the ASCII control characters.
>
>For the purposes of this email I use the term "control character"
>to mean certain special code points in Unicode.
>Examples of "control characters" are U+0000 to U+001F inclusive,
>except U+0009, U+000A, and U+000D.
>
>Please clarify the proper way to handle these, e.g. with respect to
>string normalization.
>
>Specifically, in Character Model for the World Wide Web 1.0
>          W3C Working Draft 26 January 2001
>          This version: http://www.w3.org/TR/2001/WD-charmod-20010126
>          Latest version: http://www.w3.org/TR/charmod/
>section 3.5 states:
>The specification MUST NOT arbitrarily restrict the range of characters
>that can be used, which must cover all Unicode code points from 0 to
>0x10FFFF inclusive.
>
>In contrast
>Extensible Markup Language (XML) 1.0 (Second Edition)
>          W3C Recommendation 6 October 2000
>          This version: http://www.w3.org/TR/2000/REC-xml-20001006
>          Latest version: http://www.w3.org/TR/REC-xml
>section 2.2 states:
>Consequently, XML processors must accept any character in the range
>specified for Char. ...
>Char ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>[#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate
>blocks, FFFE, and FFFF. */
>
>The above implies that a processor may reject a document containing
>a control character.  In fact some processors do that, raising an error
>that the document is not well formed.

Good catch. We'll replace:

The specification MUST NOT arbitrarily restrict the range of characters
that can be used, which must cover all Unicode code points from 0 to
0x10FFFF inclusive.

with something like:

The full range of Unicode code point extends from 0 to 0x10FFFF
inclusive. Specifications SHOULD not restrict this range of characters.

The 'SHOULD' is actually the same as 'MUST not arbitrarily'.
We assume that XML didn't forbid the control characters
arbitrarily, but had good reasons for doing so, so it is okay.

>The character model should be clear about how the "control
>characters" behave with respect to
>string normalization.  Must they be left alone?
>May they be deleted by a conforming processsor?
>Or should each one be replaced by a space, and further normalized?
>
>Or perhaps the Character Model specification should explicitly
>state that this decision is in the scope of the application.

We are not sure here that we understand your comment.
The characters model refers to UTR 15 for the details
of the normalization (and for individual characters, you
have to actually go and check the data files, or ask somebody
who thinks s/he remembers). We cannot mention each category
of characters explicitly, there are just too many of them.
For your information, the control characters you mention
are not affected by normalization.

Regarding replacement with spaces, XML does some of that
for tabs and line-end-related characters. Although it may
look similar to (Unicode) normalization, it's independent,
and as you note quite application-dependent.

Do you think we should say somehow that space issues
are something separate?

Regards,    Martin.

Received on Tuesday, 27 February 2001 17:43:38 UTC