Re: On citing Unicode from Mark Davis on 2006-03-15 (public-i18n-core@w3.org from January to March 2006)

From: Mark Davis <mark.davis@icu-project.org>
Date: Tue, 14 Mar 2006 19:12:13 -0800
To: Eric Prud'hommeaux <eric@w3.org>
CC: Richard Ishida <ishida@w3.org>, 'Felix Sasaki' <fsasaki@w3.org>, public-i18n-core@w3.org
Message-ID: <4417860D.9030801@icu-project.org>

To be absolutely correct, you would write:

[[
Unicode String: A sequence of Unicode code points [Unicode]. Also known as a 'character string': note however that a Unicode String may contain code points that are reserved (that is, not assigned to characters), to allow for compatibility with future versions of Unicode that may assign them.
]]


[[                                                                             
A SPARQL query string is a Unicode string (c.f. section 6.1
String concepts of [CHARMOD]) in the language defined by the following
grammar, starting with the Query production. For productions with excluded
character classes (for example "[^<>'{}|^`]"), the characters are
excluded from the range #x0000 - #x10FFFF.
]]



Eric Prud'hommeaux wrote:
> On Tue, Mar 14, 2006 at 12:03:05PM -0000, Richard Ishida wrote:
>   
>>> *This point about *assigned* code points is the crux of my argument.
>>> If it is wrong, and "code point" includes unassinged code 
>>> ponts, we don't need this extra text in SPARQL.
>>>       
>> The Unicode Standard defines 'code point' as "Any value in the Unicode code
>> space" (p.64).  ie. you can have unassigned code points.
>>     
>
> Excellent! It would be nice if that info were in the web.  Is that
> excerpted somewhere with a convenient anchor near it? If not, I
> suppose I need to include it in SPARQL grammar definition.
>
>   
>> Note that CharMod refers to the full range of Unicode code points as "from
>> U+0000 to U+10FFFF inclusive." http://www.w3.org/TR/charmod/#C070
>>     
>
> That almost gives me what I need, except that _Character_string_ is
> not defined in terms of a clearly stable character set:
> [[
> Character string: A string viewed as a sequence of characters, each
> represented by a code point in Unicode [Unicode].
> ]]
> C070 and C077 say that specs should use U+0000-U+10FFFF but charmod
> doesn't define a character string in terms of that range except by
> suggesting that you cite an evolving document. We know, by social
> context, that the Unicode consortium will only fill in code points in
> that range for the foreseeable future, so _Character_string_ is good
> for the same time. Not all readers of the spec share that social
> context. I'm looking for the specific words to add to give them that.
>
> Do you think that the text
> [[                                                                             
> A SPARQL query string is a Unicode character string (c.f. section 6.1
> String concepts of [CHARMOD]) in the language defined by the following
> grammar, starting with the Query production. For compatibility with
> future versions of Unicode, the characters in this string may include
> unassigned Unicode codepoints (see Identifier and Pattern Syntax
> [UNIID] section 4 Pattern Syntax). For productions with excluded
> character classes (for example "[^<>'{}|^`]"), the characters are
> excluded from the range #x00 - #x10FFFF.
> ]]
> is sufficient? It does not attribute the definition of the range
> #x00-#x10FFFF to either CharMod, as I don't see where CharMod actually
> defines _Character_string_ as being that range, or to Unicode, as a I
> haven't read it enough to know where it states the contact to use the
> range U+0000-U+10FFFF for a very long time.
>
> So, advice on wording is actively solicited. The WG will may be
> approving this text in 62 minutes.
>
>   
>> ============
>> Richard Ishida
>> Internationalization Lead
>> W3C (World Wide Web Consortium)
>>
>> http://www.w3.org/People/Ishida/
>> http://www.w3.org/International/
>> http://people.w3.org/rishida/blog/
>> http://www.flickr.com/photos/ishida/
>>  
>>
>>     
>
>

Received on Wednesday, 15 March 2006 03:12:34 UTC