- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 28 Feb 2001 07:42:35 +0900
- To: "Tolkin, Steve" <Steve.Tolkin@FMR.COM>, "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
- Cc: w3c-xml-query-wg@w3.org, w3c-i18n-ig@w3.org
Hello Steve, Many thanks for your comments. We just discussed them in the WG meeting. At 12:54 01/02/07 -0500, Tolkin, Steve wrote: >Certain Unicode "characters" have the same hexadecimal value >as the ASCII control characters. > >For the purposes of this email I use the term "control character" >to mean certain special code points in Unicode. >Examples of "control characters" are U+0000 to U+001F inclusive, >except U+0009, U+000A, and U+000D. > >Please clarify the proper way to handle these, e.g. with respect to >string normalization. > >Specifically, in Character Model for the World Wide Web 1.0 > W3C Working Draft 26 January 2001 > This version: http://www.w3.org/TR/2001/WD-charmod-20010126 > Latest version: http://www.w3.org/TR/charmod/ >section 3.5 states: >The specification MUST NOT arbitrarily restrict the range of characters >that can be used, which must cover all Unicode code points from 0 to >0x10FFFF inclusive. > >In contrast >Extensible Markup Language (XML) 1.0 (Second Edition) > W3C Recommendation 6 October 2000 > This version: http://www.w3.org/TR/2000/REC-xml-20001006 > Latest version: http://www.w3.org/TR/REC-xml >section 2.2 states: >Consequently, XML processors must accept any character in the range >specified for Char. ... >Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | >[#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate >blocks, FFFE, and FFFF. */ > >The above implies that a processor may reject a document containing >a control character. In fact some processors do that, raising an error >that the document is not well formed. Good catch. We'll replace: The specification MUST NOT arbitrarily restrict the range of characters that can be used, which must cover all Unicode code points from 0 to 0x10FFFF inclusive. with something like: The full range of Unicode code point extends from 0 to 0x10FFFF inclusive. Specifications SHOULD not restrict this range of characters. The 'SHOULD' is actually the same as 'MUST not arbitrarily'. We assume that XML didn't forbid the control characters arbitrarily, but had good reasons for doing so, so it is okay. >The character model should be clear about how the "control >characters" behave with respect to >string normalization. Must they be left alone? >May they be deleted by a conforming processsor? >Or should each one be replaced by a space, and further normalized? > >Or perhaps the Character Model specification should explicitly >state that this decision is in the scope of the application. We are not sure here that we understand your comment. The characters model refers to UTR 15 for the details of the normalization (and for individual characters, you have to actually go and check the data files, or ask somebody who thinks s/he remembers). We cannot mention each category of characters explicitly, there are just too many of them. For your information, the control characters you mention are not affected by normalization. Regarding replacement with spaces, XML does some of that for tabs and line-end-related characters. Although it may look similar to (Unicode) normalization, it's independent, and as you note quite application-dependent. Do you think we should say somehow that space issues are something separate? Regards, Martin.
Received on Tuesday, 27 February 2001 17:43:38 UTC