Re: [OK?] Re: SPARQL and Unicode versions

Eric Prud'hommeaux wrote:
> On Sun, Jan 08, 2006 at 08:54:15AM -0600, Dan Connolly wrote:
> 
>>On Sat, 2006-01-07 at 20:01 -0800, Dave Beckett wrote:
>>
>>>Dan Connolly wrote:
>>>
>>>>On Sat, 2006-01-07 at 12:38 -0800, Dave Beckett wrote:
>>>>
>>>>
>>>>>SPARQL refers to:
>>>>>
>>>>>[[
>>>>> [UNICODE]
>>>>>   The Unicode Standard, Version 4. ISBN 0-321-18578-1, as updated from
>>>>> time to time by the publication of new versions. The latest version of
>>>>> Unicode and additional information on versions of the standard and of
>>>>> the Unicode Character Database is available at
>>>>> http://www.unicode.org/unicode/standard/versions/.
>>>>>
>>>>>]]
>>>>>
>>>>>which cites a moving target.  Please define SPARQL in terms of a
>>>>>particular version of Unicode only, and no other.  Otherwise if or when
>>>>>this Unicode consortium makes some incompatible changes, all existing
>>>>>implementations become invalid.
>>>>
>>>>
>>>>How so? How is conformance to SPARQL sensitive to changes in Unicode?
>>>
>>>The SPARQL query syntax is defined on Unicode characters:
>>>
>>>[[
>>>A. SPARQL Grammar
>>>
>>>A SPARQL query string is a Unicode character string (c.f. section 6.1
>>>String concepts of [CHARMOD])
>>>...
>>>]]
>>>
>>>although the grammar defines precise ranges of codepoints for particular
>>>things such as names of variables (based on XML 1.1 I think).
>>>
>>>If the definition of a Unicode character string changes in some future
>>>Unicode revision, such as for example by allowing additional codepoints,
>>>then there will be additional codepoints allowed in a SPARQL query
>>>string, following the sentence above.
>>
>>I believe that's by design, following...
>>
>>"C063  [S]  A generic reference to the Unicode Standard MUST be made if
>>it is desired that characters allocated after a specification is
>>published are usable with that specification".
>>  http://www.w3.org/TR/2005/REC-charmod-20050215/#C063
>>
>>I suppose I should check with the WG.
>>
>>
>>>Any part of the grammar that uses an negated range such as with '[^...]'
>>>will allow such codepoints.  Examples include:
>>>  http://www.w3.org/TR/rdf-sparql-query/#rQ_IRI_REF
>>>and all string literals.
>>>
>>>These codepoints may be refused by something implementing Unicode 4.0
>>>and no more.
>>
>>I suppose we need a test case that uses a codepoint that isn't currently
>>allocated in Unicode 4.0.
>>
>>I still can't think of any reason why changes in Unicode specs would
>>make any difference to SPARQL producers/consumers. It's not like
>>they need to reference the Unicode tables to check the grammar or
>>anything.
> 
> 
> Do to lineage and good intentions, the SPARQL grammar mirrors the
> XML1.1 spec. For instance, our name chars
>   http://www.w3.org/2001/sw/DataAccess/rq23/#rNCCHAR1p
> are slight liberalizations of XML name chars
>   http://www.w3.org/TR/xml11/#NT-NameStartChar
> Strings
>   http://www.w3.org/2001/sw/DataAccess/rq23/#rSTRING_LITERAL1
> are analogous to CharData
>   http://www.w3.org/TR/xml11/#NT-CharData
> 
> Basically, our grammar follows XML's lead and maps out the use Unicode
> chars from #x00 to #xEFFFF . All Unicode chars are in this range, but
> there are lots of holes (currently undefined chars). My reading of the
> XML spec is that the grammar is fixed as Unicode grows and fills these
> holes. However, if Unicode extends beyond #xEFFFF, XML1.1 apps will
> not handle these new chars. To clarify this, and to address the
> Björn's comments, I will propose the following text at the top of the
> grammar definition:
> 
> [[
> A SPARQL query string is a Unicode character string (c.f. section 6.1
> String concepts of [CHARMOD]) in the language defined by the following
> grammar, starting with the Query production.  The EBNF format is the
> same as that used in the XML 1.1 specification[XML11]. Please see the
> "Notation" section of that specification for specific information
> about the notation.
> 
> [ Informative: this specification maps out the useage of Unicode
> characters between #x00 and #xEFFFF. Excluded character sets,
> for example "[^<>'{}|^`]", indicate the range of [#x00-#xEFFFF] minus
> those the listed characters. This specification does not include any
> future Unicode characters outside of the range [#x00-#xEFFFF]. ]
> 
> The following sections list all additional constraints on a valid
> SPARQL query:
> ...
> A.5 Escape sequences in strings
> 
> Escaped characters in strings (STRING_LITERAL1, STRING_LITERAL2,
> STRING_LITERAL_LONG1, STRING_LITERAL_LONG2) must be in the character
> ranges defined by those rules.
> ]]
> 
> Dave, Björn, what do you think?

That change is OK with me.  I guess having found out from your
description more about how future Unicode changes will occur, I would be
quite happy with no change to the text if that suits you.  The
informative addition helps this understanding.

[ Your message didn't seem to be addressed to Björn ]

Thanks

Dave

Received on Thursday, 26 January 2006 03:29:57 UTC