RE: On citing Unicode from Martin Duerst on 2006-03-15 (public-i18n-core@w3.org from January to March 2006)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 15 Mar 2006 13:31:00 +0900
To: "Richard Ishida" <ishida@w3.org>, "'Eric Prud'hommeaux'" <eric@w3.org>
Cc: "'Felix Sasaki'" <fsasaki@w3.org>, <public-i18n-core@w3.org>
Message-Id: <6.0.0.20.2.20060315131747.08b51b50@localhost>

At 23:29 06/03/14, Richard Ishida wrote:
 >
 >This is from a very quick scan...
 >
 >> A SPARQL query string is a Unicode character string (c.f.
 >> section 6.1 String concepts of [CHARMOD]) in the language
 >> defined by the following grammar, starting with the Query
 >> production. For compatibility with future versions of
 >> Unicode, the characters in this string may include unassigned
 >
 >s/include unassigned/in future include currently unassigned/

The full sentence would now read:

 >>>>
For compatibility with future versions of
Unicode, the characters in this string may in future include
currently unassigned  Unicode codepoints.
 >>>>

This doesn't clearly distinguish between what the grammar requires
for conformance (any unassigned codepoint is is okay, NOW), and what
may be desirable (you better don't include unassigned stuff, because
it just doesn't make any sense).

Some implementers might think that they have to check that currently
unassigned codepoints are currently not used, but that's exactly
what we want to avoid, in order to stay open and be able to use
future Unicode versions without having to upgrade the infrastructure.

 >> Unicode codepoints (see Identifier and Pattern Syntax [UNIID]
 >> section 4 Pattern Syntax). For productions with excluded
 >> character classes (for example "[^<>'{}|^`]"), the characters
 >> are excluded from the range #x00 - #x10FFFF.
 >
 >If you are going to reduce U+0000 to U+00, maybe we should go the whole hog
 >and say U+0.

In case of the U+ notation, always use at least four digits, as defined
in the Unicode spec. But for the case above, #x0 - #x10FFFF is best.
The XML spec itself uses #x0, as in:

 >>>>
The characters to be escaped are the control characters #x0 to #x1F
and #x7F (most of which cannot appear in XML),...
 >>>>

Regards,    Martin.

Received on Wednesday, 15 March 2006 04:56:00 UTC