W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2006

Re: On citing Unicode

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 16 Mar 2006 14:41:54 +0900
Message-Id: <6.0.0.20.2.20060316143745.08fc6950@localhost>
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: Richard Ishida <ishida@w3.org>, "'Felix Sasaki'" <fsasaki@w3.org>, public-i18n-core@w3.org

At 21:46 06/03/15, Eric Prud'hommeaux wrote:
 >On Wed, Mar 15, 2006 at 01:31:00PM +0900, Martin Duerst wrote:
 >> At 23:29 06/03/14, Richard Ishida wrote:
 >> >
 >> >This is from a very quick scan...
 >> >
 >> >> A SPARQL query string is a Unicode character string (c.f.
 >> >> section 6.1 String concepts of [CHARMOD]) in the language
 >> >> defined by the following grammar, starting with the Query
 >> >> production. For compatibility with future versions of
 >> >> Unicode, the characters in this string may include unassigned
 >> >
 >> >s/include unassigned/in future include currently unassigned/
 >>
 >> The full sentence would now read:
 >>
 >> >>>>
 >> For compatibility with future versions of
 >> Unicode, the characters in this string may in future include
 >> currently unassigned  Unicode codepoints.
 >> >>>>
 >>
 >> This doesn't clearly distinguish between what the grammar requires
 >> for conformance (any unassigned codepoint is is okay, NOW), and what
 >> may be desirable (you better don't include unassigned stuff, because
 >> it just doesn't make any sense).
 >
 >Hoping to avoid telling users what queries they should write while
 >telling implementors what parsers they should write, I propose
 >[[
 >A SPARQL query string is a Unicode character string (c.f. section 6.1
 >String concepts of [CHARMOD]) in the language defined by the following
 >grammar, starting with the Query production. For compatibility with
 >future versions of Unicode, the characters in this string may include
 >Unicode codepoints that are unassigned as of the date of this
 >publication (see Identifier and Pattern Syntax [UNIID] section 4
 >Pattern Syntax). For productions with excluded character classes (for
 >example "[^<>'{}|^`]"), the characters are excluded from the range
 >#x0 - #x10FFFF.
 >]]

That text looks good to me.



 >The only place that range is not directly specified in the grammar is
 >in the excluded character classes.

Well, yes, but it would be very weird if an excluded character class would
suddenly include characters that are not allowed in the grammar itself,
or would exclude (without them being literally excluded) characters
that are otherwise allowed. (The later sometimes happens with
things like newline characters, but then that's explicitly mentioned.)

 >The second sentence describes the
 >motivations; the third prescribes the implementation.
 >
 >good enough for you folks? and does it set a good example for future
 >specs?

Yes.


 >> >>>>
 >> The characters to be escaped are the control characters #x0 to #x1F
 >> and #x7F (most of which cannot appear in XML),...
 >> >>>>
 >
 >Our language actually allows all of those in strings:
 >  http://www.w3.org/2001/sw/DataAccess/rq23/#rSTRING_LITERAL_LONG1
 >The RDF abstract syntax does not eliminate any characters:
 >  http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-lexical-form
 >RDF data can come from other places than RDF/XML and there aren't a
 >lot of legacy SPARQL engines out there.
 >
 >I think XML allows everything except #x0 in CharData:

Yes for XML 1.1. No for XML 1.0.     Regards,   Martin.

 >office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
 >                        Shonan Fujisawa Campus, Keio University,
 >                        5322 Endo, Fujisawa, Kanagawa 252-8520
 >                        JAPAN

P.S.: Are you back to Keio :-? 
Received on Thursday, 16 March 2006 06:59:09 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:50 GMT