[OK?] Re: SPARQL: format based on Unicode?

On Sun, Jan 29, 2006 at 02:20:19AM +0100, Bjoern Hoehrmann wrote:
> 
> * Eric Prud'hommeaux wrote:
> >>   In http://www.w3.org/mid/43254eca.231195140@smtp.bjoern.hoehrmann.de
> >> I noticed that http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050721/
> >> does not seem to state that the format is based on Unicode; this makes
> >> character classes in the EBNF like [^#xD#xA] ambiguous. Please change
> >> the draft to clearly indicate that the format is based on Unicode and
> >> which characters expressions like [^#xD#xA] refer to.  See also:
> >> <http://www.w3.org/TR/charmod/#sec-RefProcModel>, specifically C070,
> >> C077, C079, and C078.
> >> 
> >> (Reference to Unicode has been added since, but it seems the current
> >> editor's draft is still unclear about whether e.g. U+0000 may appear
> >> in a query literally or escaped, as there are portability issues for
> >> some of these characters, this needs to be defined more explicitly.)
> 
> >After a discussion on IRC, I have hope that the textual changes
> >proposed in http://www.w3.org/mid/20060126021444.GZ17752@w3.org
> >will address your concearns. I would like to add that I prefer
> >to define SPARQL characters in terms of Unicode rather than in
> >terms of XML, which are, in turn, defined in terms of Unicode.
> 
> 
> Well, the draft should say which code points can be used directly or
> through \uU escape sequences. It currently says through normative re-
> ference that e.g. [^x] is a shorthand for 
> 
>   [#x1-y] | [z-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> 
> And it does not constrain escape sequences in any way. So I can e.g.
> use \udb40\udc7f. Does this refer to the code points U+DB40,U+DC7F or
> to the single character U+E007F? In XML this question does not come
> up because &#xdb40;&#xdc7f; is not well-formed. (The two code points
> are surrogate code points, virtual code points that allow use of
> characters with a scalar value > 0xFFFF in UTF-16 for example). What
> the draft probably should say is when referring to XML 1.1 for the
> normative definition of the EBNF format is
> 
>   Note: XML 1.1 defines the EBNF notation in terms of the Char
>         production; SPARQL inherits this definition and use of
>         code points like U+0000 is thus not allowed in SPARQL
>         queries.
> 
> or something like that, and later for the \u escapes
> 
>   ... Characters referred to using escape sequences MUST
>   match the production for Char as defined in XML 1.1.
> 
> This is the 'Legal Character' well-formedness constraint in XML 1.1
> which prohibits &#xdb40;&#xdc7f; and its kind. Changes to this effect
> would address my concern. I don't know whether this is desired though,
> as I understand it, Java \uXXXX escapes may refer to surrogate pairs,
> so there might be a mismatch with this.
> 
> I also note that the draft says the \u and \U escape sequences are
> included in the grammar through ECHAR but ECHAR does not actually allow
> [uU] to follow the backslash.

I have positive feedback [I18N] from i18n on:

[[
A. SPARQL Grammar

A SPARQL query string is a Unicode character string (c.f. section 6.1
String concepts of [CHARMOD]) in the language defined by the following
grammar, starting with the Query production. For compatibility with
future versions of Unicode, the characters in this string may include
Unicode codepoints that are unassigned as of the date of this
publication (see Identifier and Pattern Syntax [UNIID] section 4
Pattern Syntax). For productions with excluded character classes (for
example [^<>'{}|^`]), the characters are excluded from the range #x0 -
#x10FFFF.

The EBNF notation used in the grammar is defined in Extensible Markup
Language (XML) 1.1 [XML11] section 6 Notation.

In addition, rules A.1 to A.5 apply.

...

A.5 Escape sequences in strings

The following escape sequences may be used in any string production
(e.g. STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1,
STRING_LITERAL_LONG2): 

<table elided but consistent with earlier editions/>

where HEX  is a hexadecimal character

    HEX ::= [0-9] | [A-F] | [a-f]

Any escaped character in the range #x0 - #x10FFFF may appear in any
string production. For instance, "\n" may appear in a STRING_LITERAL1
even though the unescaped form is not valid in that production.

Examples:

...
]]

As always, please indicate if this satisfies your comments.

[I18N] http://www.w3.org/mid/6.0.0.20.2.20060316143745.08fc6950@localhost
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Thursday, 23 March 2006 20:25:56 UTC