- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Wed, 25 Jan 2006 21:14:44 -0500
- To: Dan Connolly <connolly@w3.org>
- Cc: Dave Beckett <dave@dajobe.org>, public-rdf-dawg-comments@w3.org
- Message-ID: <20060126021444.GZ17752@w3.org>
On Sun, Jan 08, 2006 at 08:54:15AM -0600, Dan Connolly wrote: > > On Sat, 2006-01-07 at 20:01 -0800, Dave Beckett wrote: > > Dan Connolly wrote: > > > On Sat, 2006-01-07 at 12:38 -0800, Dave Beckett wrote: > > > > > >>SPARQL refers to: > > >> > > >>[[ > > >> [UNICODE] > > >> The Unicode Standard, Version 4. ISBN 0-321-18578-1, as updated from > > >> time to time by the publication of new versions. The latest version of > > >> Unicode and additional information on versions of the standard and of > > >> the Unicode Character Database is available at > > >> http://www.unicode.org/unicode/standard/versions/. > > >> > > >>]] > > >> > > >>which cites a moving target. Please define SPARQL in terms of a > > >>particular version of Unicode only, and no other. Otherwise if or when > > >>this Unicode consortium makes some incompatible changes, all existing > > >>implementations become invalid. > > > > > > > > > How so? How is conformance to SPARQL sensitive to changes in Unicode? > > > > The SPARQL query syntax is defined on Unicode characters: > > > > [[ > > A. SPARQL Grammar > > > > A SPARQL query string is a Unicode character string (c.f. section 6.1 > > String concepts of [CHARMOD]) > > ... > > ]] > > > > although the grammar defines precise ranges of codepoints for particular > > things such as names of variables (based on XML 1.1 I think). > > > > If the definition of a Unicode character string changes in some future > > Unicode revision, such as for example by allowing additional codepoints, > > then there will be additional codepoints allowed in a SPARQL query > > string, following the sentence above. > > I believe that's by design, following... > > "C063 [S] A generic reference to the Unicode Standard MUST be made if > it is desired that characters allocated after a specification is > published are usable with that specification". > http://www.w3.org/TR/2005/REC-charmod-20050215/#C063 > > I suppose I should check with the WG. > > > Any part of the grammar that uses an negated range such as with '[^...]' > > will allow such codepoints. Examples include: > > http://www.w3.org/TR/rdf-sparql-query/#rQ_IRI_REF > > and all string literals. > > > > These codepoints may be refused by something implementing Unicode 4.0 > > and no more. > > I suppose we need a test case that uses a codepoint that isn't currently > allocated in Unicode 4.0. > > I still can't think of any reason why changes in Unicode specs would > make any difference to SPARQL producers/consumers. It's not like > they need to reference the Unicode tables to check the grammar or > anything. Do to lineage and good intentions, the SPARQL grammar mirrors the XML1.1 spec. For instance, our name chars http://www.w3.org/2001/sw/DataAccess/rq23/#rNCCHAR1p are slight liberalizations of XML name chars http://www.w3.org/TR/xml11/#NT-NameStartChar Strings http://www.w3.org/2001/sw/DataAccess/rq23/#rSTRING_LITERAL1 are analogous to CharData http://www.w3.org/TR/xml11/#NT-CharData Basically, our grammar follows XML's lead and maps out the use Unicode chars from #x00 to #xEFFFF . All Unicode chars are in this range, but there are lots of holes (currently undefined chars). My reading of the XML spec is that the grammar is fixed as Unicode grows and fills these holes. However, if Unicode extends beyond #xEFFFF, XML1.1 apps will not handle these new chars. To clarify this, and to address the Björn's comments, I will propose the following text at the top of the grammar definition: [[ A SPARQL query string is a Unicode character string (c.f. section 6.1 String concepts of [CHARMOD]) in the language defined by the following grammar, starting with the Query production. The EBNF format is the same as that used in the XML 1.1 specification[XML11]. Please see the "Notation" section of that specification for specific information about the notation. [ Informative: this specification maps out the useage of Unicode characters between #x00 and #xEFFFF. Excluded character sets, for example "[^<>'{}|^`]", indicate the range of [#x00-#xEFFFF] minus those the listed characters. This specification does not include any future Unicode characters outside of the range [#x00-#xEFFFF]. ] The following sections list all additional constraints on a valid SPARQL query: ... A.5 Escape sequences in strings Escaped characters in strings (STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1, STRING_LITERAL_LONG2) must be in the character ranges defined by those rules. ]] Dave, Björn, what do you think? -- -eric office: +81.466.49.1170 W3C, Keio Research Institute at SFC, Shonan Fujisawa Campus, Keio University, 5322 Endo, Fujisawa, Kanagawa 252-8520 JAPAN +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA cell: +81.90.6533.3882 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Thursday, 26 January 2006 02:14:48 UTC