Re: are SPARQL queries unicode?

Eric Prud'hommeaux wrote:
> Björn commented [CMNT] that productions like:
>   NCCHAR ::= NCCHAR1 | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040]
> and even
>   WS ::= #x20 | #x9 | #xD | #xA
> need to specify a codepoint convention for those numbers to mean
> anything.
> 
> We've since visited this text, but in the interest of clarity, I am
> considering changing our current text from:
> [[
> A SPARQL query string is a Unicode character string (c.f. section 6.1
> String concepts of [CHARMOD]) in the language defined by the following
> grammar, starting with the Query production.  The EBNF format is the
> same as that used in the XML 1.1 specification[XML11]. Please see the
> "Notation" section of that specification for specific information about
> the notation.
> ]]
> 
> to:
> [[
> A SPARQL query is a string (c.f. section 6.1 String concepts of
> [CHARMOD]) in the language defined by the following grammar, starting
> with the Query production.  The EBNF format is the same as that used in
> the XML 1.1 specification[XML11]. Numeric references,
> e.g. <code>#x27</code> or <code>#xxD7FF</code>, identify charactars by
> unicode codepoint. Please see the "Notation" section of that
> specification for specific information about the notation.
> ]]
> 
> This says that the grammar is read as unicode codepoints (editorial)

Fine - this needs to be said and is said in A (the grammar section). Adding 
the "Numeric references ..." is a good idea.

> and says that SPARQL Queries are independent of encoding (substantive).

"Encoding" is UTF-8 etc not unicode, as I understand it.  Unicode characters 
are the abstraction and there are various ways to encode them.

Charmod says:

"""
C012   [S]  The 'character string' definition SHOULD be used by most 
specifications.
"""

which suggests that plain "string" is unclear.  Using "characters string", we 
then need to say what the space of charcaters is and Unicode seems like a good 
choice.  As the text says that a SPARQL query is a string it still has to be 
parsed against a Unicode grammar.

Is there a specific use case you have in mind that the new wording allows but 
the old wording does not?

 Andy

> 
> [CMNT] http://www.w3.org/mid/43046b29.399234875@smtp.bjoern.hoehrmann.de

Received on Thursday, 27 October 2005 11:05:21 UTC