Re: are SPARQL queries unicode?

At 21:46 05/10/27, Eric Prud'hommeaux wrote:
 >I'm involving the I18N folks in this 'cause they must hae an opinion.
 >
 >Summary for I18N folks:
 >  1. SPARQL has a grammar that's specified in terms of the XML's "EBNF
 >     format".
 >  2. SPARQL sais that at SPARQL Query is a unicode string that follows
 >     the grammar.
 >  3. SPARQL has a media-type registration (on deck) with no charset
 >     parameter. UTF-8 is hardcoded as the only way to express SPARQL
 >     queries in that media type.

I like this, especially the last point.

 >Below, I propose text that makes it more clear that we are using
 >unicode codepoints in our grammar.
 >
 >Is it better to say that the grammar is specified in Unicode
 >codepoints that to say that the language is a Unicode string? For
 >instance, I've attached some text
 >  SELECT ?p
 >   WHERE { ?s ?p ?o }
 >
 >in a shift-jis attachment. This is how my Japanese cell phone sends
 >text. Is it a SPARQL query? It's written in an encoding that is not
 >defined in terms of unicode, but does map to unicode (trivially, in
 >fact, for the ascii subset). My thesis is that it is better to say
 >that the grammar is Unicode than that all expressions of the language
 >are in Unicode.

Why the Shift_JIS example? Didn't you say that all queries are
in UTF-8? Or is it only the queries that are sent over the net
with the mime type you define? This would definitely be most
important for interoperability. One could immagine other encodings
e.g. for queries that get passed through some API that somehow
knows the encoding.

As for terminology, saying that the language is a Unicode string
is definitely wrong. The language is not a single string, but
a set of strings (often called 'words' in formal language theory).
You fixed that above when you said 'all expressions of the language
are in Unicode'.

For what to write in the spec itself, I suggest you have a good
look at http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets.
That section, as far as I know, has passed the test of time.


 >From the candidate media type registration [REG]
 >[[
 >Encoding considerations:
 >    The syntax of the SPARQL Query Language is expressed over code
 >    points in Unicode[UNICODE 3.0]. The encoding is always UTF-8.
 >]]

Please follow Felix's suggestions for how to cite Unicode.

 >Is it a good idea to have a conservative media type?

What do you mean by 'conservative' here?

 >The protocol
 >document [PROT] includes two "binings" (a WSDL term) and says that
 >both use UTF-8 for their encoding. When the input comes from a SOAP
 >request, it can rely upon (but does not currently dictate) RFC3023
 >"XML Media Types" for media type declaration. As the input is not
 >defined in terms of the media type, I don't think any text would
 >have to change even if the media type allowed alternate encodings.
 >
 >[REG] http://www.w3.org/2001/sw/DataAccess/rq23/#mediaType
 >[PROT] http://www.w3.org/TR/rdf-sparql-protocol/

I'm not sure I understand this. Reading the protocol document
requires a lot of knowledge about WSDL. I have only looked
at the examples. With respect to the HTTP examlpes, I think
it is very important to not just use "EncodedQuery", because
it is crucial for interoperability that implementations get
this encoding correct, both with respect to what characters
to escape and with respect to how to treat non-ASCII characters
(of course, the right thing is to first use UTF-8, and then
%HH encoding, so that this is compatible with the IRI spec,
but this has to be specified (unless it follows from the
WSDL bindings, which I hope, but in which case it should
nevertheless be mentioned and used in a few examples
explicitly)).

 >On Tue, Oct 25, 2005 at 01:32:07PM -0400, Eric Prud'hommeaux wrote:
 >> Bj$B‹S(Bn commented [CMNT] that productions like:
 >>   NCCHAR ::= NCCHAR1 | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | 
[#x203F-#x2040]
 >> and even
 >>   WS ::= #x20 | #x9 | #xD | #xA
 >> need to specify a codepoint convention for those numbers to mean
 >> anything.
 >>
 >> We've since visited this text, but in the interest of clarity, I am
 >> considering changing our current text from:
 >> [[
 >> A SPARQL query string is a Unicode character string (c.f. section 6.1
 >> String concepts of [CHARMOD]) in the language defined by the following
 >> grammar, starting with the Query production.  The EBNF format is the
 >> same as that used in the XML 1.1 specification[XML11]. Please see the
 >> "Notation" section of that specification for specific information about
 >> the notation.
 >> ]]

I don't know how the text looked before, but the above text is
perfectly fine, because XML 1.1 explicitly links the #x notation
to ISO 10646 (which is equivalent codepoint-by-codepoint with
Unicode). Of course, there is no problem with explaining part
of the XML 1.1 notation, but in that case, it should be made
clear that this is just an exlanation for convenience, not the
real thing.

Regards,   Martin.

 >> to:
 >> [[
 >> A SPARQL query is a string (c.f. section 6.1 String concepts of
 >> [CHARMOD]) in the language defined by the following grammar, starting
 >> with the Query production.  The EBNF format is the same as that used in
 >> the XML 1.1 specification[XML11]. Numeric references,
 >> e.g. <code>#x27</code> or <code>#xxD7FF</code>, identify charactars by
 >> unicode codepoint. Please see the "Notation" section of that
 >> specification for specific information about the notation.
 >> ]]
 >>
 >> This says that the grammar is read as unicode codepoints (editorial)
 >> and says that SPARQL Queries are independent of encoding (substantive).
 >>
 >> [CMNT] http://www.w3.org/mid/43046b29.399234875@smtp.bjoern.hoehrmann.de
 >
 >
 >
 >--
 >-eric
 >
 >office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
 >                        Shonan Fujisawa Campus, Keio University,
 >                        5322 Endo, Fujisawa, Kanagawa 252-8520
 >                        JAPAN
 >        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
 >cell:   +81.90.6533.3882
 >
 >(eric@w3.org)
 >Feel free to forward this message to any list for any purpose other than
 >email address distribution.
 >
 >Return-Path: <ericP@t.vodafone.ne.jp>
 >X-Original-To: eric@homer.w3.org
 >Delivered-To: eric@homer.w3.org
 >Received: from lisa.w3.org (lisa.w3.org [128.30.52.41])by homer.w3.org
 >(Postfix) with ESMTP id B3B4E4EEBFfor <eric@homer.w3.org>; Thu, 27 Oct 2005
 >08:24:17 -0400 (EDT)
 >Received: from tgms3mtts02sc1.t.vodafone.ne.jp ([210.228.189.37])by
 >lisa.w3.org with esmtp (Exim 4.50)id 1EV6nl-0002ox-RWfor eric@w3.org; Thu,
 >27 Oct 2005 12:24:16 +0000
 >Received: from [10.5.64.185] by tgms3mtts02sc1.t.vodafone.ne.jp with SMTPid
 ><20051027122408617.CNZU.518014@tgms3mtts02sc1.t.vodafone.ne.jp>for
 ><eric@w3.org>; Thu, 27 Oct 2005 21:24:08 +0900
 >From: <ericP@t.vodafone.ne.jp>
 >To: <eric@w3.org>
 >Subject: Is this a SPARQL Query?
 >X-Priority: 3
 >MIME-Version: 1.0
 >Content-Type: text/plain; charset="iso-2022-jp"
 >Content-Transfer-Encoding: 7bit
 >Message-Id: <20051027122408617.CNZU.518014@tgms3mtts02sc1.t.vodafone.ne.jp>
 >Date: Thu, 27 Oct 2005 21:24:10 +0900
 >Reply-To: <ericP@t.vodafone.ne.jp>
 >Received-SPF: none (lisa.w3.org: domain of ericP@t.vodafone.ne.jp does not
 >designate permitted sender hosts)
 >X-W3C-Hub-Spam-Status: No, score=0.9
 >X-W3C-Scan-Sig: lisa.w3.org 1EV6nl-0002ox-RW a286db8ca05aa8cbe9cebab677ea6887
 >X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on homer.w3.org
 >X-Spam-Level:
 >X-Spam-Status: No, score=-0.8 required=4.5
 >tests=AWL,BAYES_00,NO_REAL_NAME,PRIORITY_NO_NAME autolearn=no version=3.0.3
 >
 >SELECT ?p
 > WHERE { ?s ?p ?o }
 >
 >
 >Content-Type: application/pgp-signature; name="signature.asc"
 >Content-Description: Digital signature
 >Content-Disposition: inline
 >
 >-----BEGIN PGP SIGNATURE-----
 >Version: GnuPG v1.4.1 (GNU/Linux)
 >
 >iQEVAwUBQ2DMLZZX2p1ccTnpAQK8Nwf8D0UJT773XrqLc6pfHKOl0/Y9oWPOqVwX
 >As38YWHeVlLrWhKO3/p3KFmIntewGCYQb/Vmo7aHtc+VeSZh3mNojhJCIxI1pHCq
 >3sEDOfUKCskDCqIz+DETHkZyjz9tcHcArwu7080ntnJx5j2kIXe9rn+C1isBHnr+
 >HM4HySQbNhTrZxk2QbzGPG8OlK+PPeCiEtFXrGVWfLvxfqZHmI3MZoJYgmWcZWIf
 >Qqz1enRW98T7Womk2fr+jfxRO9duey//LSjrUaVagOjVX+3TJ8RyGgCrjhrZY6pc
 >1sH0Cl0j7wvrFe6lY6D6MRAMZV6n6QBJY6H7xh9/cThIAi/afjrM1g==
 >=kr4i
 >-----END PGP SIGNATURE----- 

Received on Thursday, 3 November 2005 08:43:52 UTC