Re: On citing Unicode from Eric Prud'hommeaux on 2006-03-15 (public-i18n-core@w3.org from January to March 2006)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Wed, 15 Mar 2006 07:46:48 -0500
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Cc: Richard Ishida <ishida@w3.org>, 'Felix Sasaki' <fsasaki@w3.org>, public-i18n-core@w3.org
Message-ID: <20060315124648.GH20832@w3.org>
On Wed, Mar 15, 2006 at 01:31:00PM +0900, Martin Duerst wrote:
> At 23:29 06/03/14, Richard Ishida wrote:
> >
> >This is from a very quick scan...
> >
> >> A SPARQL query string is a Unicode character string (c.f.
> >> section 6.1 String concepts of [CHARMOD]) in the language
> >> defined by the following grammar, starting with the Query
> >> production. For compatibility with future versions of
> >> Unicode, the characters in this string may include unassigned
> >
> >s/include unassigned/in future include currently unassigned/
> 
> The full sentence would now read:
> 
> >>>>
> For compatibility with future versions of
> Unicode, the characters in this string may in future include
> currently unassigned  Unicode codepoints.
> >>>>
> 
> This doesn't clearly distinguish between what the grammar requires
> for conformance (any unassigned codepoint is is okay, NOW), and what
> may be desirable (you better don't include unassigned stuff, because
> it just doesn't make any sense).

Hoping to avoid telling users what queries they should write while
telling implementors what parsers they should write, I propose
[[
A SPARQL query string is a Unicode character string (c.f. section 6.1
String concepts of [CHARMOD]) in the language defined by the following
grammar, starting with the Query production. For compatibility with
future versions of Unicode, the characters in this string may include
Unicode codepoints that are unassigned as of the date of this
publication (see Identifier and Pattern Syntax [UNIID] section 4
Pattern Syntax). For productions with excluded character classes (for
example "[^<>'{}|^`]"), the characters are excluded from the range
#x0 - #x10FFFF.
]]

> Some implementers might think that they have to check that currently
> unassigned codepoints are currently not used, but that's exactly
> what we want to avoid, in order to stay open and be able to use
> future Unicode versions without having to upgrade the infrastructure.

My feeling is that "unassigned as of the date of this publication"
(withing manipulating the approved text beyond my editorial latitude).
The only place that range is not directly specified in the grammar is
in the excluded character classes. The second sentence describes the
motivations; the third prescribes the implementation.

good enough for you folks? and does it set a good example for future
specs?

> >> Unicode codepoints (see Identifier and Pattern Syntax [UNIID]
> >> section 4 Pattern Syntax). For productions with excluded
> >> character classes (for example "[^<>'{}|^`]"), the characters
> >> are excluded from the range #x00 - #x10FFFF.
> >
> >If you are going to reduce U+0000 to U+00, maybe we should go the whole hog
> >and say U+0.
> 
> In case of the U+ notation, always use at least four digits, as defined
> in the Unicode spec. But for the case above, #x0 - #x10FFFF is best.
> The XML spec itself uses #x0, as in:

done.

> >>>>
> The characters to be escaped are the control characters #x0 to #x1F
> and #x7F (most of which cannot appear in XML),...
> >>>>

Our language actually allows all of those in strings:
  http://www.w3.org/2001/sw/DataAccess/rq23/#rSTRING_LITERAL_LONG1
The RDF abstract syntax does not eliminate any characters:
  http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-lexical-form
RDF data can come from other places than RDF/XML and there aren't a
lot of legacy SPARQL engines out there.

I think XML allows everything except #x0 in CharData:
  [14]	    CharData	         ::=	 [^<&]* - ([^<&]* ']]>' [^<&]*)
  http://www.w3.org/TR/2004/REC-xml11-20040204/#NT-CharData
  Due to potential problems with APIs, #x0 is still forbidden both
  directly and as a character reference.
  http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-xml11
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Wednesday, 15 March 2006 12:47:02 UTC