- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Mon, 21 Nov 2005 18:35:39 -0500
- To: Martin Duerst <duerst@it.aoyama.ac.jp>
- Cc: public-i18n-core@w3.org, public-rdf-dawg@w3.org
- Message-ID: <20051121233539.GF17026@w3.org>
On Thu, Nov 03, 2005 at 05:37:56PM +0900, Martin Duerst wrote: > > At 21:46 05/10/27, Eric Prud'hommeaux wrote: > >I'm involving the I18N folks in this 'cause they must hae an opinion. > > > >Summary for I18N folks: > > 1. SPARQL has a grammar that's specified in terms of the XML's "EBNF > > format". > > 2. SPARQL sais that at SPARQL Query is a unicode string that follows > > the grammar. > > 3. SPARQL has a media-type registration (on deck) with no charset > > parameter. UTF-8 is hardcoded as the only way to express SPARQL > > queries in that media type. > > I like this, especially the last point. too bad noone uses it... > >Below, I propose text that makes it more clear that we are using > >unicode codepoints in our grammar. > > > >Is it better to say that the grammar is specified in Unicode > >codepoints that to say that the language is a Unicode string? For > >instance, I've attached some text > > SELECT ?p > > WHERE { ?s ?p ?o } > > > >in a shift-jis attachment. This is how my Japanese cell phone sends > >text. Is it a SPARQL query? It's written in an encoding that is not > >defined in terms of unicode, but does map to unicode (trivially, in > >fact, for the ascii subset). My thesis is that it is better to say > >that the grammar is Unicode than that all expressions of the language > >are in Unicode. > > Why the Shift_JIS example? Didn't you say that all queries are > in UTF-8? Or is it only the queries that are sent over the net > with the mime type you define? This would definitely be most > important for interoperability. One could immagine other encodings > e.g. for queries that get passed through some API that somehow > knows the encoding. I chatted with Martin and explained that the SPARQL Query Language does not require a specific encoding, and he explained to me that XML is still the state of the art for charmod compliance. XML uses ISO 10646 http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets [[ Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. ]] which is interchangable with Unicode. The Shift_JIS example was to see if the language could include non-unicode charsets that intersected with Unicode for at least all the character used in a given query (my Shift_JIS example used only [A-Za-z\?\{\}\.]). I gather that the answer is "no"; that the only way I can know where they intersect is if I use a Shift_JIS that's defined in terms of Unicode. > As for terminology, saying that the language is a Unicode string > is definitely wrong. The language is not a single string, but > a set of strings (often called 'words' in formal language theory). > You fixed that above when you said 'all expressions of the language > are in Unicode'. > > For what to write in the spec itself, I suggest you have a good > look at http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets. > That section, as far as I know, has passed the test of time. > > > >From the candidate media type registration [REG] > >[[ > >Encoding considerations: > > The syntax of the SPARQL Query Language is expressed over code > > points in Unicode[UNICODE 3.0]. The encoding is always UTF-8. > >]] > > Please follow Felix's suggestions for how to cite Unicode. done -- http://www.w3.org/mid/20051119202228.GB17026@w3.org > >Is it a good idea to have a conservative media type? > > What do you mean by 'conservative' here? "conservative" = Allowing only one encoding instead of having a default encoding and an optional charset parameter. I guess that's fine. This will encourage APIs and other emergent protocols to use utf-8, which will simplify life for implementors and users. > >The protocol > >document [PROT] includes two "binings" (a WSDL term) and says that > >both use UTF-8 for their encoding. When the input comes from a SOAP > >request, it can rely upon (but does not currently dictate) RFC3023 > >"XML Media Types" for media type declaration. As the input is not > >defined in terms of the media type, I don't think any text would > >have to change even if the media type allowed alternate encodings. > > > >[REG] http://www.w3.org/2001/sw/DataAccess/rq23/#mediaType > >[PROT] http://www.w3.org/TR/rdf-sparql-protocol/ > > I'm not sure I understand this. Reading the protocol document > requires a lot of knowledge about WSDL. I have only looked > at the examples. With respect to the HTTP examlpes, I think > it is very important to not just use "EncodedQuery", because > it is crucial for interoperability that implementations get > this encoding correct, both with respect to what characters > to escape and with respect to how to treat non-ASCII characters > (of course, the right thing is to first use UTF-8, and then > %HH encoding, so that this is compatible with the IRI spec, > but this has to be specified (unless it follows from the > WSDL bindings, which I hope, but in which case it should > nevertheless be mentioned and used in a few examples > explicitly)). This is my current Protocol doc todo (to address) list: 1 Describe and cite the mechanics to create an EncodedQuery. 2 Propose an example, probably http://www.w3.org/2001/sw/DataAccess/tests/data/i18n/kanji-02.rq > >On Tue, Oct 25, 2005 at 01:32:07PM -0400, Eric Prud'hommeaux wrote: > >> Bj?Sn commented [CMNT] that productions like: > >> NCCHAR ::= NCCHAR1 | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | > [#x203F-#x2040] > >> and even > >> WS ::= #x20 | #x9 | #xD | #xA > >> need to specify a codepoint convention for those numbers to mean > >> anything. > >> > >> We've since visited this text, but in the interest of clarity, I am > >> considering changing our current text from: > >> [[ > >> A SPARQL query string is a Unicode character string (c.f. section 6.1 > >> String concepts of [CHARMOD]) in the language defined by the following > >> grammar, starting with the Query production. The EBNF format is the > >> same as that used in the XML 1.1 specification[XML11]. Please see the > >> "Notation" section of that specification for specific information about > >> the notation. > >> ]] > > I don't know how the text looked before, but the above text is > perfectly fine, because XML 1.1 explicitly links the #x notation > to ISO 10646 (which is equivalent codepoint-by-codepoint with > Unicode). Of course, there is no problem with explaining part > of the XML 1.1 notation, but in that case, it should be made > clear that this is just an exlanation for convenience, not the > real thing. excellent. this is already done. > >> to: > >> [[ > >> A SPARQL query is a string (c.f. section 6.1 String concepts of > >> [CHARMOD]) in the language defined by the following grammar, starting > >> with the Query production. The EBNF format is the same as that used in > >> the XML 1.1 specification[XML11]. Numeric references, > >> e.g. <code>#x27</code> or <code>#xxD7FF</code>, identify charactars by > >> unicode codepoint. Please see the "Notation" section of that > >> specification for specific information about the notation. > >> ]] > >> > >> This says that the grammar is read as unicode codepoints (editorial) > >> and says that SPARQL Queries are independent of encoding (substantive). > >> > >> [CMNT] http://www.w3.org/mid/43046b29.399234875@smtp.bjoern.hoehrmann.de > > > > > > > a286db8ca05aa8cbe9cebab677ea6887 > >X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on homer.w3.org > >X-Spam-Level: > >X-Spam-Status: No, score=-0.8 required=4.5 > >tests=AWL,BAYES_00,NO_REAL_NAME,PRIORITY_NO_NAME autolearn=no version=3.0.3 > > > >SELECT ?p > > WHERE { ?s ?p ?o } > > > > > >Content-Type: application/pgp-signature; name="signature.asc" > >Content-Description: Digital signature > >Content-Disposition: inline > > > >-----BEGIN PGP SIGNATURE----- > >Version: GnuPG v1.4.1 (GNU/Linux) > > > >iQEVAwUBQ2DMLZZX2p1ccTnpAQK8Nwf8D0UJT773XrqLc6pfHKOl0/Y9oWPOqVwX > >As38YWHeVlLrWhKO3/p3KFmIntewGCYQb/Vmo7aHtc+VeSZh3mNojhJCIxI1pHCq > >3sEDOfUKCskDCqIz+DETHkZyjz9tcHcArwu7080ntnJx5j2kIXe9rn+C1isBHnr+ > >HM4HySQbNhTrZxk2QbzGPG8OlK+PPeCiEtFXrGVWfLvxfqZHmI3MZoJYgmWcZWIf > >Qqz1enRW98T7Womk2fr+jfxRO9duey//LSjrUaVagOjVX+3TJ8RyGgCrjhrZY6pc > >1sH0Cl0j7wvrFe6lY6D6MRAMZV6n6QBJY6H7xh9/cThIAi/afjrM1g== > >=kr4i > >-----END PGP SIGNATURE----- > -- -eric office: +81.466.49.1170 W3C, Keio Research Institute at SFC, Shonan Fujisawa Campus, Keio University, 5322 Endo, Fujisawa, Kanagawa 252-8520 JAPAN +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA cell: +81.90.6533.3882 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Monday, 21 November 2005 23:35:45 UTC