Re: are SPARQL queries unicode?

On Thu, Nov 03, 2005 at 05:37:56PM +0900, Martin Duerst wrote:
> 
> At 21:46 05/10/27, Eric Prud'hommeaux wrote:
> >I'm involving the I18N folks in this 'cause they must hae an opinion.
> >
> >Summary for I18N folks:
> >  1. SPARQL has a grammar that's specified in terms of the XML's "EBNF
> >     format".
> >  2. SPARQL sais that at SPARQL Query is a unicode string that follows
> >     the grammar.
> >  3. SPARQL has a media-type registration (on deck) with no charset
> >     parameter. UTF-8 is hardcoded as the only way to express SPARQL
> >     queries in that media type.
> 
> I like this, especially the last point.

too bad noone uses it...

> >Below, I propose text that makes it more clear that we are using
> >unicode codepoints in our grammar.
> >
> >Is it better to say that the grammar is specified in Unicode
> >codepoints that to say that the language is a Unicode string? For
> >instance, I've attached some text
> >  SELECT ?p
> >   WHERE { ?s ?p ?o }
> >
> >in a shift-jis attachment. This is how my Japanese cell phone sends
> >text. Is it a SPARQL query? It's written in an encoding that is not
> >defined in terms of unicode, but does map to unicode (trivially, in
> >fact, for the ascii subset). My thesis is that it is better to say
> >that the grammar is Unicode than that all expressions of the language
> >are in Unicode.
> 
> Why the Shift_JIS example? Didn't you say that all queries are
> in UTF-8? Or is it only the queries that are sent over the net
> with the mime type you define? This would definitely be most
> important for interoperability. One could immagine other encodings
> e.g. for queries that get passed through some API that somehow
> knows the encoding.

I chatted with Martin and explained that the SPARQL Query Language
does not require a specific encoding, and he explained to me that XML
is still the state of the art for charmod compliance.

XML uses ISO 10646
  http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets
[[
Legal characters are tab, carriage return, line feed, and the legal
characters of Unicode and ISO/IEC 10646.
]]
which is interchangable with Unicode.

The Shift_JIS example was to see if the language could include
non-unicode charsets that intersected with Unicode for at least all
the character used in a given query (my Shift_JIS example used only
[A-Za-z\?\{\}\.]). I gather that the answer is "no"; that the only
way I can know where they intersect is if I use a Shift_JIS that's
defined in terms of Unicode.

> As for terminology, saying that the language is a Unicode string
> is definitely wrong. The language is not a single string, but
> a set of strings (often called 'words' in formal language theory).
> You fixed that above when you said 'all expressions of the language
> are in Unicode'.
> 
> For what to write in the spec itself, I suggest you have a good
> look at http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets.
> That section, as far as I know, has passed the test of time.
> 
> 
> >From the candidate media type registration [REG]
> >[[
> >Encoding considerations:
> >    The syntax of the SPARQL Query Language is expressed over code
> >    points in Unicode[UNICODE 3.0]. The encoding is always UTF-8.
> >]]
> 
> Please follow Felix's suggestions for how to cite Unicode.

done -- http://www.w3.org/mid/20051119202228.GB17026@w3.org

> >Is it a good idea to have a conservative media type?
> 
> What do you mean by 'conservative' here?

"conservative" = Allowing only one encoding instead of having a
default encoding and an optional charset parameter.

I guess that's fine. This will encourage APIs and other emergent
protocols to use utf-8, which will simplify life for implementors
and users.

> >The protocol
> >document [PROT] includes two "binings" (a WSDL term) and says that
> >both use UTF-8 for their encoding. When the input comes from a SOAP
> >request, it can rely upon (but does not currently dictate) RFC3023
> >"XML Media Types" for media type declaration. As the input is not
> >defined in terms of the media type, I don't think any text would
> >have to change even if the media type allowed alternate encodings.
> >
> >[REG] http://www.w3.org/2001/sw/DataAccess/rq23/#mediaType
> >[PROT] http://www.w3.org/TR/rdf-sparql-protocol/
> 
> I'm not sure I understand this. Reading the protocol document
> requires a lot of knowledge about WSDL. I have only looked
> at the examples. With respect to the HTTP examlpes, I think
> it is very important to not just use "EncodedQuery", because
> it is crucial for interoperability that implementations get
> this encoding correct, both with respect to what characters
> to escape and with respect to how to treat non-ASCII characters
> (of course, the right thing is to first use UTF-8, and then
> %HH encoding, so that this is compatible with the IRI spec,
> but this has to be specified (unless it follows from the
> WSDL bindings, which I hope, but in which case it should
> nevertheless be mentioned and used in a few examples
> explicitly)).

This is my current Protocol doc todo (to address) list:

  1 Describe and cite the mechanics to create an EncodedQuery.

  2 Propose an example, probably
    http://www.w3.org/2001/sw/DataAccess/tests/data/i18n/kanji-02.rq

> >On Tue, Oct 25, 2005 at 01:32:07PM -0400, Eric Prud'hommeaux wrote:
> >> Bj?Sn commented [CMNT] that productions like:
> >>   NCCHAR ::= NCCHAR1 | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | 
> [#x203F-#x2040]
> >> and even
> >>   WS ::= #x20 | #x9 | #xD | #xA
> >> need to specify a codepoint convention for those numbers to mean
> >> anything.
> >>
> >> We've since visited this text, but in the interest of clarity, I am
> >> considering changing our current text from:
> >> [[
> >> A SPARQL query string is a Unicode character string (c.f. section 6.1
> >> String concepts of [CHARMOD]) in the language defined by the following
> >> grammar, starting with the Query production.  The EBNF format is the
> >> same as that used in the XML 1.1 specification[XML11]. Please see the
> >> "Notation" section of that specification for specific information about
> >> the notation.
> >> ]]
> 
> I don't know how the text looked before, but the above text is
> perfectly fine, because XML 1.1 explicitly links the #x notation
> to ISO 10646 (which is equivalent codepoint-by-codepoint with
> Unicode). Of course, there is no problem with explaining part
> of the XML 1.1 notation, but in that case, it should be made
> clear that this is just an exlanation for convenience, not the
> real thing.

excellent. this is already done.

> >> to:
> >> [[
> >> A SPARQL query is a string (c.f. section 6.1 String concepts of
> >> [CHARMOD]) in the language defined by the following grammar, starting
> >> with the Query production.  The EBNF format is the same as that used in
> >> the XML 1.1 specification[XML11]. Numeric references,
> >> e.g. <code>#x27</code> or <code>#xxD7FF</code>, identify charactars by
> >> unicode codepoint. Please see the "Notation" section of that
> >> specification for specific information about the notation.
> >> ]]
> >>
> >> This says that the grammar is read as unicode codepoints (editorial)
> >> and says that SPARQL Queries are independent of encoding (substantive).
> >>
> >> [CMNT] http://www.w3.org/mid/43046b29.399234875@smtp.bjoern.hoehrmann.de
> >
> >
> >
> a286db8ca05aa8cbe9cebab677ea6887
> >X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on homer.w3.org
> >X-Spam-Level:
> >X-Spam-Status: No, score=-0.8 required=4.5
> >tests=AWL,BAYES_00,NO_REAL_NAME,PRIORITY_NO_NAME autolearn=no version=3.0.3
> >
> >SELECT ?p
> > WHERE { ?s ?p ?o }
> >
> >
> >Content-Type: application/pgp-signature; name="signature.asc"
> >Content-Description: Digital signature
> >Content-Disposition: inline
> >
> >-----BEGIN PGP SIGNATURE-----
> >Version: GnuPG v1.4.1 (GNU/Linux)
> >
> >iQEVAwUBQ2DMLZZX2p1ccTnpAQK8Nwf8D0UJT773XrqLc6pfHKOl0/Y9oWPOqVwX
> >As38YWHeVlLrWhKO3/p3KFmIntewGCYQb/Vmo7aHtc+VeSZh3mNojhJCIxI1pHCq
> >3sEDOfUKCskDCqIz+DETHkZyjz9tcHcArwu7080ntnJx5j2kIXe9rn+C1isBHnr+
> >HM4HySQbNhTrZxk2QbzGPG8OlK+PPeCiEtFXrGVWfLvxfqZHmI3MZoJYgmWcZWIf
> >Qqz1enRW98T7Womk2fr+jfxRO9duey//LSjrUaVagOjVX+3TJ8RyGgCrjhrZY6pc
> >1sH0Cl0j7wvrFe6lY6D6MRAMZV6n6QBJY6H7xh9/cThIAi/afjrM1g==
> >=kr4i
> >-----END PGP SIGNATURE----- 
> 

-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Monday, 21 November 2005 23:35:45 UTC