Re: are SPARQL queries unicode? from Eric Prud'hommeaux on 2005-10-27 (public-rdf-dawg@w3.org from October to December 2005)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Thu, 27 Oct 2005 08:46:38 -0400
To: public-i18n-core@w3.org
Cc: public-rdf-dawg@w3.org
Message-ID: <20051027124638.GW412@w3.org>

I'm involving the I18N folks in this 'cause they must hae an opinion.

Summary for I18N folks:
  1. SPARQL has a grammar that's specified in terms of the XML's "EBNF
     format".
  2. SPARQL sais that at SPARQL Query is a unicode string that follows
     the grammar.
  3. SPARQL has a media-type registration (on deck) with no charset
     parameter. UTF-8 is hardcoded as the only way to express SPARQL
     queries in that media type.

Below, I propose text that makes it more clear that we are using
unicode codepoints in our grammar.

Is it better to say that the grammar is specified in Unicode
codepoints that to say that the language is a Unicode string? For
instance, I've attached some text
  SELECT ?p
   WHERE { ?s ?p ?o }

in a shift-jis attachment. This is how my Japanese cell phone sends
text. Is it a SPARQL query? It's written in an encoding that is not
defined in terms of unicode, but does map to unicode (trivially, in
fact, for the ascii subset). My thesis is that it is better to say
that the grammar is Unicode than that all expressions of the language
are in Unicode.

From the candidate media type registration [REG]
[[
Encoding considerations:
    The syntax of the SPARQL Query Language is expressed over code
    points in Unicode[UNICODE 3.0]. The encoding is always UTF-8.
]]

Is it a good idea to have a conservative media type? The protocol
document [PROT] includes two "binings" (a WSDL term) and says that
both use UTF-8 for their encoding. When the input comes from a SOAP
request, it can rely upon (but does not currently dictate) RFC3023
"XML Media Types" for media type declaration. As the input is not
defined in terms of the media type, I don't think any text would
have to change even if the media type allowed alternate encodings.

[REG] http://www.w3.org/2001/sw/DataAccess/rq23/#mediaType
[PROT] http://www.w3.org/TR/rdf-sparql-protocol/

On Tue, Oct 25, 2005 at 01:32:07PM -0400, Eric Prud'hommeaux wrote:
> Björn commented [CMNT] that productions like:
>   NCCHAR ::= NCCHAR1 | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040]
> and even
>   WS ::= #x20 | #x9 | #xD | #xA
> need to specify a codepoint convention for those numbers to mean
> anything.
> 
> We've since visited this text, but in the interest of clarity, I am
> considering changing our current text from:
> [[
> A SPARQL query string is a Unicode character string (c.f. section 6.1
> String concepts of [CHARMOD]) in the language defined by the following
> grammar, starting with the Query production.  The EBNF format is the
> same as that used in the XML 1.1 specification[XML11]. Please see the
> "Notation" section of that specification for specific information about
> the notation.
> ]]
> 
> to:
> [[
> A SPARQL query is a string (c.f. section 6.1 String concepts of
> [CHARMOD]) in the language defined by the following grammar, starting
> with the Query production.  The EBNF format is the same as that used in
> the XML 1.1 specification[XML11]. Numeric references,
> e.g. <code>#x27</code> or <code>#xxD7FF</code>, identify charactars by
> unicode codepoint. Please see the "Notation" section of that
> specification for specific information about the notation.
> ]]
> 
> This says that the grammar is read as unicode codepoints (editorial)
> and says that SPARQL Queries are independent of encoding (substantive).
> 
> [CMNT] http://www.w3.org/mid/43046b29.399234875@smtp.bjoern.hoehrmann.de

-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Forwarded message 1

From: ericP@t.vodafone.ne.jp <ericP@t.vodafone.ne.jp>
Date: Thu, 27 Oct 2005 21:24:10 +0900
Subject: Is this a SPARQL Query?
To: <eric@w3.org>
Message-Id: <20051027122408617.CNZU.518014@tgms3mtts02sc1.t.vodafone.ne.jp>

SELECT ?p
 WHERE { ?s ?p ?o }

Received on Thursday, 27 October 2005 12:46:48 UTC