Re: [OK?] Re: SPARQL: format based on Unicode? from Bjoern Hoehrmann on 2006-01-29 (public-rdf-dawg-comments@w3.org from January 2006)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sun, 29 Jan 2006 02:20:19 +0100
To: Eric Prud'hommeaux <eric@w3.org>
Cc: public-rdf-dawg-comments@w3.org
Message-ID: <90vnt1dqjg0d74lfe4j21f69bpofniafea@hive.bjoern.hoehrmann.de>

* Eric Prud'hommeaux wrote:
>>   In http://www.w3.org/mid/43254eca.231195140@smtp.bjoern.hoehrmann.de
>> I noticed that http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050721/
>> does not seem to state that the format is based on Unicode; this makes
>> character classes in the EBNF like [^#xD#xA] ambiguous. Please change
>> the draft to clearly indicate that the format is based on Unicode and
>> which characters expressions like [^#xD#xA] refer to.  See also:
>> <http://www.w3.org/TR/charmod/#sec-RefProcModel>, specifically C070,
>> C077, C079, and C078.
>> 
>> (Reference to Unicode has been added since, but it seems the current
>> editor's draft is still unclear about whether e.g. U+0000 may appear
>> in a query literally or escaped, as there are portability issues for
>> some of these characters, this needs to be defined more explicitly.)

>After a discussion on IRC, I have hope that the textual changes
>proposed in http://www.w3.org/mid/20060126021444.GZ17752@w3.org
>will address your concearns. I would like to add that I prefer
>to define SPARQL characters in terms of Unicode rather than in
>terms of XML, which are, in turn, defined in terms of Unicode.

Well, the draft should say which code points can be used directly or
through \uU escape sequences. It currently says through normative re-
ference that e.g. [^x] is a shorthand for 

  [#x1-y] | [z-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

And it does not constrain escape sequences in any way. So I can e.g.
use \udb40\udc7f. Does this refer to the code points U+DB40,U+DC7F or
to the single character U+E007F? In XML this question does not come
up because &#xdb40;&#xdc7f; is not well-formed. (The two code points
are surrogate code points, virtual code points that allow use of
characters with a scalar value > 0xFFFF in UTF-16 for example). What
the draft probably should say is when referring to XML 1.1 for the
normative definition of the EBNF format is

  Note: XML 1.1 defines the EBNF notation in terms of the Char
        production; SPARQL inherits this definition and use of
        code points like U+0000 is thus not allowed in SPARQL
        queries.

or something like that, and later for the \u escapes

  ... Characters referred to using escape sequences MUST
  match the production for Char as defined in XML 1.1.

This is the 'Legal Character' well-formedness constraint in XML 1.1
which prohibits &#xdb40;&#xdc7f; and its kind. Changes to this effect
would address my concern. I don't know whether this is desired though,
as I understand it, Java \uXXXX escapes may refer to surrogate pairs,
so there might be a mismatch with this.

I also note that the draft says the \u and \U escape sequences are
included in the grammar through ECHAR but ECHAR does not actually allow
[uU] to follow the backslash.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Sunday, 29 January 2006 01:19:30 UTC