Re: On citing Unicode from Eric Prud'hommeaux on 2006-03-14 (public-i18n-core@w3.org from January to March 2006)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Tue, 14 Mar 2006 08:17:13 -0500
To: Richard Ishida <ishida@w3.org>
Cc: 'Felix Sasaki' <fsasaki@w3.org>, public-i18n-core@w3.org
Message-ID: <20060314131713.GF20832@w3.org>

On Tue, Mar 14, 2006 at 12:03:05PM -0000, Richard Ishida wrote:
> > *This point about *assigned* code points is the crux of my argument.
> > If it is wrong, and "code point" includes unassinged code 
> > ponts, we don't need this extra text in SPARQL.
> 
> The Unicode Standard defines 'code point' as "Any value in the Unicode code
> space" (p.64).  ie. you can have unassigned code points.

Excellent! It would be nice if that info were in the web.  Is that
excerpted somewhere with a convenient anchor near it? If not, I
suppose I need to include it in SPARQL grammar definition.

> Note that CharMod refers to the full range of Unicode code points as "from
> U+0000 to U+10FFFF inclusive." http://www.w3.org/TR/charmod/#C070

That almost gives me what I need, except that _Character_string_ is
not defined in terms of a clearly stable character set:
[[
Character string: A string viewed as a sequence of characters, each
represented by a code point in Unicode [Unicode].
]]
C070 and C077 say that specs should use U+0000-U+10FFFF but charmod
doesn't define a character string in terms of that range except by
suggesting that you cite an evolving document. We know, by social
context, that the Unicode consortium will only fill in code points in
that range for the foreseeable future, so _Character_string_ is good
for the same time. Not all readers of the spec share that social
context. I'm looking for the specific words to add to give them that.

Do you think that the text
[[                                                                             
A SPARQL query string is a Unicode character string (c.f. section 6.1
String concepts of [CHARMOD]) in the language defined by the following
grammar, starting with the Query production. For compatibility with
future versions of Unicode, the characters in this string may include
unassigned Unicode codepoints (see Identifier and Pattern Syntax
[UNIID] section 4 Pattern Syntax). For productions with excluded
character classes (for example "[^<>'{}|^`]"), the characters are
excluded from the range #x00 - #x10FFFF.
]]
is sufficient? It does not attribute the definition of the range
#x00-#x10FFFF to either CharMod, as I don't see where CharMod actually
defines _Character_string_ as being that range, or to Unicode, as a I
haven't read it enough to know where it states the contact to use the
range U+0000-U+10FFFF for a very long time.

So, advice on wording is actively solicited. The WG will may be
approving this text in 62 minutes.

> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
> 
> http://www.w3.org/People/Ishida/
> http://www.w3.org/International/
> http://people.w3.org/rishida/blog/
> http://www.flickr.com/photos/ishida/
>  
> 

-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Tuesday, 14 March 2006 13:17:22 UTC