[CLOSED] Re: SPARQL and Unicode versions

replying with a [CLOSED] subject to indicate comment resolution.

On Wed, Jan 25, 2006 at 07:29:50PM -0800, Dave Beckett wrote:
> 
> Eric Prud'hommeaux wrote:
> > On Sun, Jan 08, 2006 at 08:54:15AM -0600, Dan Connolly wrote:
> > 
> >>On Sat, 2006-01-07 at 20:01 -0800, Dave Beckett wrote:
> >>
> >>>Dan Connolly wrote:
> >>>
> >>>>On Sat, 2006-01-07 at 12:38 -0800, Dave Beckett wrote:
> >>>>
> >>>>
> >>>>>SPARQL refers to:
> >>>>>
> >>>>>[[
> >>>>> [UNICODE]
> >>>>>   The Unicode Standard, Version 4. ISBN 0-321-18578-1, as updated from
> >>>>> time to time by the publication of new versions. The latest version of
> >>>>> Unicode and additional information on versions of the standard and of
> >>>>> the Unicode Character Database is available at
> >>>>> http://www.unicode.org/unicode/standard/versions/.
> >>>>>
> >>>>>]]
> >>>>>
> >>>>>which cites a moving target.  Please define SPARQL in terms of a
> >>>>>particular version of Unicode only, and no other.  Otherwise if or when
> >>>>>this Unicode consortium makes some incompatible changes, all existing
> >>>>>implementations become invalid.
> >>>>
> >>>>
> >>>>How so? How is conformance to SPARQL sensitive to changes in Unicode?
> >>>
> >>>The SPARQL query syntax is defined on Unicode characters:
> >>>
> >>>[[
> >>>A. SPARQL Grammar
> >>>
> >>>A SPARQL query string is a Unicode character string (c.f. section 6.1
> >>>String concepts of [CHARMOD])
> >>>...
> >>>]]
> >>>
> >>>although the grammar defines precise ranges of codepoints for particular
> >>>things such as names of variables (based on XML 1.1 I think).
> >>>
> >>>If the definition of a Unicode character string changes in some future
> >>>Unicode revision, such as for example by allowing additional codepoints,
> >>>then there will be additional codepoints allowed in a SPARQL query
> >>>string, following the sentence above.
> >>
> >>I believe that's by design, following...
> >>
> >>"C063  [S]  A generic reference to the Unicode Standard MUST be made if
> >>it is desired that characters allocated after a specification is
> >>published are usable with that specification".
> >>  http://www.w3.org/TR/2005/REC-charmod-20050215/#C063
> >>
> >>I suppose I should check with the WG.
> >>
> >>
> >>>Any part of the grammar that uses an negated range such as with '[^...]'
> >>>will allow such codepoints.  Examples include:
> >>>  http://www.w3.org/TR/rdf-sparql-query/#rQ_IRI_REF
> >>>and all string literals.
> >>>
> >>>These codepoints may be refused by something implementing Unicode 4.0
> >>>and no more.
> >>
> >>I suppose we need a test case that uses a codepoint that isn't currently
> >>allocated in Unicode 4.0.
> >>
> >>I still can't think of any reason why changes in Unicode specs would
> >>make any difference to SPARQL producers/consumers. It's not like
> >>they need to reference the Unicode tables to check the grammar or
> >>anything.
> > 
> > 
> > Do to lineage and good intentions, the SPARQL grammar mirrors the
> > XML1.1 spec. For instance, our name chars
> >   http://www.w3.org/2001/sw/DataAccess/rq23/#rNCCHAR1p
> > are slight liberalizations of XML name chars
> >   http://www.w3.org/TR/xml11/#NT-NameStartChar
> > Strings
> >   http://www.w3.org/2001/sw/DataAccess/rq23/#rSTRING_LITERAL1
> > are analogous to CharData
> >   http://www.w3.org/TR/xml11/#NT-CharData
> > 
> > Basically, our grammar follows XML's lead and maps out the use Unicode
> > chars from #x00 to #xEFFFF . All Unicode chars are in this range, but
> > there are lots of holes (currently undefined chars). My reading of the
> > XML spec is that the grammar is fixed as Unicode grows and fills these
> > holes. However, if Unicode extends beyond #xEFFFF, XML1.1 apps will
> > not handle these new chars. To clarify this, and to address the
> > Björn's comments, I will propose the following text at the top of the
> > grammar definition:
> > 
> > [[
> > A SPARQL query string is a Unicode character string (c.f. section 6.1
> > String concepts of [CHARMOD]) in the language defined by the following
> > grammar, starting with the Query production.  The EBNF format is the
> > same as that used in the XML 1.1 specification[XML11]. Please see the
> > "Notation" section of that specification for specific information
> > about the notation.
> > 
> > [ Informative: this specification maps out the useage of Unicode
> > characters between #x00 and #xEFFFF. Excluded character sets,
> > for example "[^<>'{}|^`]", indicate the range of [#x00-#xEFFFF] minus
> > those the listed characters. This specification does not include any
> > future Unicode characters outside of the range [#x00-#xEFFFF]. ]
> > 
> > The following sections list all additional constraints on a valid
> > SPARQL query:
> > ...
> > A.5 Escape sequences in strings
> > 
> > Escaped characters in strings (STRING_LITERAL1, STRING_LITERAL2,
> > STRING_LITERAL_LONG1, STRING_LITERAL_LONG2) must be in the character
> > ranges defined by those rules.
> > ]]
> > 
> > Dave, Björn, what do you think?
> 
> That change is OK with me.  I guess having found out from your
> description more about how future Unicode changes will occur, I would be
> quite happy with no change to the text if that suits you.  The
> informative addition helps this understanding.

My preference is to have that text in the spec so that other folks
will gain the same understanding. I assume from your response that
that is also acceptable. The XML1.1 spec does not explain this and
I think it leaves people wondering.

> [ Your message didn't seem to be addressed to Björn ]

Björn see all

actually, in response to that point, I mailed Björn directly with a
link to this thread. tx.

> Thanks
> 
> Dave

-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Thursday, 26 January 2006 15:11:05 UTC