Re: are SPARQL queries unicode? from Martin Duerst on 2005-11-03 (public-i18n-core@w3.org from October to December 2005)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 03 Nov 2005 17:13:06 +0900
To: "Felix Sasaki" <fsasaki@w3.org>, "Eric Prud'hommeaux" <eric@w3.org>, public-i18n-core@w3.org
Cc: public-rdf-dawg@w3.org
Message-Id: <6.0.0.20.2.20051103170333.06bf08f0@localhost>
Hello Felix, Eric,

At 13:02 05/11/02, Felix Sasaki wrote:
 >
 >Hi Eric,
 >
 >Sorry for the late reply. I would propose the following text for the media
 >type registration:
 >
 >The syntax of SPARQL is expressed in Unicode but may be written with any
 >Unicode-compatible character encoding, including UTF-8 or UTF-16, or
 >transported as US-ASCII or Latin-1 with Unicode characters outside the
 >range of the given encoding represented using an XML-style &#xddd; syntax.

This is wrong, because SPARQL doesn't use &#xddd;, and in XML, it's
usually &#xhhhh; and &#dddd; (h and d standing for hexadecimal and decimal).
SPARQL uses \uhhhh or \Uhhhhhh.

Also, it restricts the use of escapings to cases where the character
cannot be expressed in the encoding used, which is good practice, but
not necessary from a spec viewpoint. It is unclear whether escaping
conventions need to be mentioned in the mime type registration.
http://www.ietf.org/rfc/rfc3023.txt doesn't say anything about
escaping within XML at all. It's on different layer.

Also, lots of newer protocols move from 'any encoding goes' to
having some encodings that need to be understood by all receivers
(e.g. UTF-8 and UTF-16 for XML) to using only a single encoding
(such as UTF-8). Given that SPARQL is totally new, the latest
approach should be taken in my view. (Internationalization is
a lot about 'let a thousand flowers bloom', but it is important
to do that at the right place).

 >This is based on the XQuery media type regstration. It should also answer
 >your question with respect to the SPARQL grammar.

It wasn't clear from your mail whether this sentence is
part of the text you proposed, or an explanation for the
text. I assume it's the later.

Regards,   Martin.

 >As for citing Unicode, pleaes have a look at
 >http://www.w3.org/TR/charmod/#sec-RefUnicode
 >An example citation would be
 >The Unicode Consortium. The Unicode Standard, Version 4. ISBN
 >0-321-18578-1, as updated from time to time by the publication of new
 >versions. The latest version of Unicode  and additional information on
 >versions of the standard and of the Unicode Character Database is
 >available at http://www.unicode.org/unicode/standard/versions/.
 >
 >
 >Best,
 >
 >Felix
 >
 >On Thu, 27 Oct 2005 21:46:38 +0900, Eric Prud'hommeaux <eric@w3.org> wrote:
 >
 >> I'm involving the I18N folks in this 'cause they must hae an opinion.
 >>
 >> Summary for I18N folks:
 >>   1. SPARQL has a grammar that's specified in terms of the XML's "EBNF
 >>      format".
 >>   2. SPARQL sais that at SPARQL Query is a unicode string that follows
 >>      the grammar.
 >>   3. SPARQL has a media-type registration (on deck) with no charset
 >>      parameter. UTF-8 is hardcoded as the only way to express SPARQL
 >>      queries in that media type.
 >>
 >> Below, I propose text that makes it more clear that we are using
 >> unicode codepoints in our grammar.
 >>
 >> Is it better to say that the grammar is specified in Unicode
 >> codepoints that to say that the language is a Unicode string? For
 >> instance, I've attached some text
 >>   SELECT ?p
 >>    WHERE { ?s ?p ?o }
 >>
 >> in a shift-jis attachment. This is how my Japanese cell phone sends
 >> text. Is it a SPARQL query? It's written in an encoding that is not
 >> defined in terms of unicode, but does map to unicode (trivially, in
 >> fact, for the ascii subset). My thesis is that it is better to say
 >> that the grammar is Unicode than that all expressions of the language
 >> are in Unicode.
 >>
 >> From the candidate media type registration [REG]
 >> [[
 >> Encoding considerations:
 >>     The syntax of the SPARQL Query Language is expressed over code
 >>     points in Unicode[UNICODE 3.0]. The encoding is always UTF-8.
 >> ]]
 >>
 >> Is it a good idea to have a conservative media type? The protocol
 >> document [PROT] includes two "binings" (a WSDL term) and says that
 >> both use UTF-8 for their encoding. When the input comes from a SOAP
 >> request, it can rely upon (but does not currently dictate) RFC3023
 >> "XML Media Types" for media type declaration. As the input is not
 >> defined in terms of the media type, I don't think any text would
 >> have to change even if the media type allowed alternate encodings.
 >>
 >> [REG] http://www.w3.org/2001/sw/DataAccess/rq23/#mediaType
 >> [PROT] http://www.w3.org/TR/rdf-sparql-protocol/
 >>
 >> On Tue, Oct 25, 2005 at 01:32:07PM -0400, Eric Prud'hommeaux wrote:
 >>> Bjテカrn commented [CMNT] that productions like:
 >>>   NCCHAR ::= NCCHAR1 | '-' | [0-9] | #x00B7 | [#x0300-#x036F] |
 >>> [#x203F-#x2040]
 >>> and even
 >>>   WS ::= #x20 | #x9 | #xD | #xA
 >>> need to specify a codepoint convention for those numbers to mean
 >>> anything.
 >>>
 >>> We've since visited this text, but in the interest of clarity, I am
 >>> considering changing our current text from:
 >>> [[
 >>> A SPARQL query string is a Unicode character string (c.f. section 6.1
 >>> String concepts of [CHARMOD]) in the language defined by the following
 >>> grammar, starting with the Query production.  The EBNF format is the
 >>> same as that used in the XML 1.1 specification[XML11]. Please see the
 >>> "Notation" section of that specification for specific information about
 >>> the notation.
 >>> ]]
 >>>
 >>> to:
 >>> [[
 >>> A SPARQL query is a string (c.f. section 6.1 String concepts of
 >>> [CHARMOD]) in the language defined by the following grammar, starting
 >>> with the Query production.  The EBNF format is the same as that used in
 >>> the XML 1.1 specification[XML11]. Numeric references,
 >>> e.g. <code>#x27</code> or <code>#xxD7FF</code>, identify charactars by
 >>> unicode codepoint. Please see the "Notation" section of that
 >>> specification for specific information about the notation.
 >>> ]]
 >>>
 >>> This says that the grammar is read as unicode codepoints (editorial)
 >>> and says that SPARQL Queries are independent of encoding (substantive).
 >>>
 >>> [CMNT] http://www.w3.org/mid/43046b29.399234875@smtp.bjoern.hoehrmann.de
 >>
 >>
 >>
 >
 >
 >
Received on Thursday, 3 November 2005 08:43:51 UTC