- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sun, 29 Jan 2006 02:20:19 +0100
- To: Eric Prud'hommeaux <eric@w3.org>
- Cc: public-rdf-dawg-comments@w3.org
* Eric Prud'hommeaux wrote: >> In http://www.w3.org/mid/43254eca.231195140@smtp.bjoern.hoehrmann.de >> I noticed that http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050721/ >> does not seem to state that the format is based on Unicode; this makes >> character classes in the EBNF like [^#xD#xA] ambiguous. Please change >> the draft to clearly indicate that the format is based on Unicode and >> which characters expressions like [^#xD#xA] refer to. See also: >> <http://www.w3.org/TR/charmod/#sec-RefProcModel>, specifically C070, >> C077, C079, and C078. >> >> (Reference to Unicode has been added since, but it seems the current >> editor's draft is still unclear about whether e.g. U+0000 may appear >> in a query literally or escaped, as there are portability issues for >> some of these characters, this needs to be defined more explicitly.) >After a discussion on IRC, I have hope that the textual changes >proposed in http://www.w3.org/mid/20060126021444.GZ17752@w3.org >will address your concearns. I would like to add that I prefer >to define SPARQL characters in terms of Unicode rather than in >terms of XML, which are, in turn, defined in terms of Unicode. Well, the draft should say which code points can be used directly or through \uU escape sequences. It currently says through normative re- ference that e.g. [^x] is a shorthand for [#x1-y] | [z-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] And it does not constrain escape sequences in any way. So I can e.g. use \udb40\udc7f. Does this refer to the code points U+DB40,U+DC7F or to the single character U+E007F? In XML this question does not come up because �� is not well-formed. (The two code points are surrogate code points, virtual code points that allow use of characters with a scalar value > 0xFFFF in UTF-16 for example). What the draft probably should say is when referring to XML 1.1 for the normative definition of the EBNF format is Note: XML 1.1 defines the EBNF notation in terms of the Char production; SPARQL inherits this definition and use of code points like U+0000 is thus not allowed in SPARQL queries. or something like that, and later for the \u escapes ... Characters referred to using escape sequences MUST match the production for Char as defined in XML 1.1. This is the 'Legal Character' well-formedness constraint in XML 1.1 which prohibits �� and its kind. Changes to this effect would address my concern. I don't know whether this is desired though, as I understand it, Java \uXXXX escapes may refer to surrogate pairs, so there might be a mismatch with this. I also note that the draft says the \u and \U escape sequences are included in the grammar through ECHAR but ECHAR does not actually allow [uU] to follow the backslash. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Sunday, 29 January 2006 01:19:30 UTC