Re: proposed clarifications to the SPARQL grammar from Seaborne, Andy on 2006-03-10 (public-rdf-dawg@w3.org from January to March 2006)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Fri, 10 Mar 2006 16:04:46 +0000
To: Eric Prud'hommeaux <eric@w3.org>
CC: public-rdf-dawg@w3.org
Message-ID: <4411A39E.6050901@hp.com>
Eric Prud'hommeaux wrote:
> I addressed the "SPARQL and Unicode versions" comment with some text
> proposed in
>   http://www.w3.org/mid/20060126021444.GZ17752@w3.org
> Bjoern Hoehrmann pointed out several remaining shortcomings in
>   http://www.w3.org/mid/90vnt1dqjg0d74lfe4j21f69bpofniafea@hive.bjoern.hoehrmann.de
> To address these issues, I propose the following change to
>   http://www.w3.org/2001/sw/DataAccess/rq23/#grammar
> 
> I would like to change A. SPARQL Grammar from
> [[
> A SPARQL query string is a Unicode character string (c.f. section 6.1
> String concepts of [CHARMOD]) in the language defined by the following
> grammar, starting with the Query production.  The EBNF format is the same
> as that used in the XML 1.1 specification[XML11]. Please see the
> "Notation" section of that specification for specific information about
> the notation.
> 
> In addition, the following sections apply.
> ]]
> to
> [[
> A SPARQL query string is a Unicode character string (c.f. section 6.1
> String concepts of [CHARMOD]) in the language defined by the following
> grammar, starting with the Query production. For compatibility with future
> versions of Unicode, the characters in this string may include unassigned
> Unicode codepoints (see Identifier and Pattern Syntax [UNIID] section 4
> Pattern Syntax). For productions with excluded character classes (for
> example "[^<>'{}|^`]"), the characters are excluded from the range #x00 -
> #xEFFFFF.
> 
> The EBNF notation used in the grammar is defined in Extensible Markup
> Language (XML) 1.1 [XML11] section 6 Notation.
> 
> In addition, rules A.1 to A.5 apply.
> ]]

Content-wise that seems like a good change.

Editorially, I wonder if it would be clearer to
+ have a Unicode section (A.1 and bump the rest all up one)
+ Move the EBNF text to A.7.

Or just move the EBNF text and put the Unicode stuff in as a separate paragraph.

> 
> and add an informative reference to
> 
> [UNIID] Identifier and Pattern Syntax 4.1.0, Mark Davis, Unicode Standard
> Annex #31, 25 March 2005, http://www.unicode.org/reports/tr31/tr31-5.html .
> Latest version available at http://www.unicode.org/reports/tr31/ .
> 
> 
> 
> Further, I would like to address Bjoern's comments on escape sequences by
> modifying
> [[
> A.5 Escape sequences in strings
> 
> Strings are used for the lexical form of RDF terms and in expressions.
> Within a string, the following escape sequences apply. The escape
> character is backslash "\" (#x5C). No other escape sequences are defined
> for strings.  Names for characters given are the common names.
> 
> These escape sequences apply to all rules making up the rule for string
> (rules: STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1,
> STRING_LITERAL_LONG2).
> 
> <table>
> 
> where HEX  is a hexadecimal character
> 
>     HEX ::= [0-9] | [A-F] | [a-f]
> 
> Examples:
> ...
> ]]
> to
> [[
> A.5 Escape sequences in strings
> 
> The following escape sequences may be used in any string production
> (e.g. STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1,
> STRING_LITERAL_LONG2):
> 
> <table>

HEX bit?

> 
> Any escaped character in the range #x00 - #xEFFFFF may appear in any
> string production. For instance, "\n" may appear in a STRING_LITERAL1 even
> though the unescaped form is not valid in that production.
> ]]
> 

I think this is the right direction for string literals.  The \n illustration 
is good.

> This clarifies n points:
>   - parsers must be able to process currently unassigned Unicode characters.
>   - SPARQL strings include the character #x00.
>   - which codepoints can be produced through \uU escape sequences.
>   - there *is* a difference between escaped characters in strings and
>     escaped characters in variable names and IRI references.
> 
> I specify the range to be #x00 - #xEFFFFF while XML 1.1 uses #x01 -
> #xEFFFFF, citing "Due to potential problems with APIs, #x0 is still
> forbidden both directly and as a character reference." I read our LC
> document as allowing #x00 - #xEFFFFF and am trying to avoid any
> changes to the language at this late date. I don't think the
> liberalization will hurt us.

It is only the #x00 that I can't judge.  XML left it out for a reason - I'm 
happy to include it in SPARQL but would prefer a positive reason.

 Andy
Received on Friday, 10 March 2006 16:05:08 UTC