- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Wed, 15 Mar 2006 07:46:48 -0500
- To: Martin Duerst <duerst@it.aoyama.ac.jp>
- Cc: Richard Ishida <ishida@w3.org>, 'Felix Sasaki' <fsasaki@w3.org>, public-i18n-core@w3.org
- Message-ID: <20060315124648.GH20832@w3.org>
On Wed, Mar 15, 2006 at 01:31:00PM +0900, Martin Duerst wrote: > At 23:29 06/03/14, Richard Ishida wrote: > > > >This is from a very quick scan... > > > >> A SPARQL query string is a Unicode character string (c.f. > >> section 6.1 String concepts of [CHARMOD]) in the language > >> defined by the following grammar, starting with the Query > >> production. For compatibility with future versions of > >> Unicode, the characters in this string may include unassigned > > > >s/include unassigned/in future include currently unassigned/ > > The full sentence would now read: > > >>>> > For compatibility with future versions of > Unicode, the characters in this string may in future include > currently unassigned Unicode codepoints. > >>>> > > This doesn't clearly distinguish between what the grammar requires > for conformance (any unassigned codepoint is is okay, NOW), and what > may be desirable (you better don't include unassigned stuff, because > it just doesn't make any sense). Hoping to avoid telling users what queries they should write while telling implementors what parsers they should write, I propose [[ A SPARQL query string is a Unicode character string (c.f. section 6.1 String concepts of [CHARMOD]) in the language defined by the following grammar, starting with the Query production. For compatibility with future versions of Unicode, the characters in this string may include Unicode codepoints that are unassigned as of the date of this publication (see Identifier and Pattern Syntax [UNIID] section 4 Pattern Syntax). For productions with excluded character classes (for example "[^<>'{}|^`]"), the characters are excluded from the range #x0 - #x10FFFF. ]] > Some implementers might think that they have to check that currently > unassigned codepoints are currently not used, but that's exactly > what we want to avoid, in order to stay open and be able to use > future Unicode versions without having to upgrade the infrastructure. My feeling is that "unassigned as of the date of this publication" (withing manipulating the approved text beyond my editorial latitude). The only place that range is not directly specified in the grammar is in the excluded character classes. The second sentence describes the motivations; the third prescribes the implementation. good enough for you folks? and does it set a good example for future specs? > >> Unicode codepoints (see Identifier and Pattern Syntax [UNIID] > >> section 4 Pattern Syntax). For productions with excluded > >> character classes (for example "[^<>'{}|^`]"), the characters > >> are excluded from the range #x00 - #x10FFFF. > > > >If you are going to reduce U+0000 to U+00, maybe we should go the whole hog > >and say U+0. > > In case of the U+ notation, always use at least four digits, as defined > in the Unicode spec. But for the case above, #x0 - #x10FFFF is best. > The XML spec itself uses #x0, as in: done. > >>>> > The characters to be escaped are the control characters #x0 to #x1F > and #x7F (most of which cannot appear in XML),... > >>>> Our language actually allows all of those in strings: http://www.w3.org/2001/sw/DataAccess/rq23/#rSTRING_LITERAL_LONG1 The RDF abstract syntax does not eliminate any characters: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-lexical-form RDF data can come from other places than RDF/XML and there aren't a lot of legacy SPARQL engines out there. I think XML allows everything except #x0 in CharData: [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) http://www.w3.org/TR/2004/REC-xml11-20040204/#NT-CharData Due to potential problems with APIs, #x0 is still forbidden both directly and as a character reference. http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-xml11 -- -eric office: +81.466.49.1170 W3C, Keio Research Institute at SFC, Shonan Fujisawa Campus, Keio University, 5322 Endo, Fujisawa, Kanagawa 252-8520 JAPAN +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA cell: +81.90.6533.3882 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Wednesday, 15 March 2006 12:47:02 UTC