- From: Gavin Carothers <gavin@carothers.name>
- Date: Tue, 22 Nov 2011 13:26:12 -0800
- To: Andy Seaborne <andy.seaborne@epimorphics.com>
- Cc: Richard Cyganiak <richard@cyganiak.de>, RDF-WG <public-rdf-wg@w3.org>
On Tue, Nov 22, 2011 at 1:04 PM, Andy Seaborne <andy.seaborne@epimorphics.com> wrote: > Richard, > > With a goal of maximising compatibility between Turtle and SPARQL, > maximising compatibility from both heritiages is important. > > SPARQL 1.0 allows \u in prefix names (and in fact uniformly) > > SPARQL is already changing to accommodate Turtle in a major way for > implementers > > Turtle can make a smaller change to accommodate SPARQL. > (smaller because it does not change the design of a Turtle parser as it does > to a SPARQL one) > > I'm open to adding %-support to prefix names (Turtle and SPARQL) but that is > really a separate issue. > > Andy > > More inline - some of your examples are about %-encoding in prefixed names > and not about unicode escapes. > > On 22/11/11 19:48, Richard Cyganiak wrote: > ... >>> >>> If you have them at all anywhere, having them where the real >>> characters are already legal seems consistent. >> >> 1. Turtle as existing and as implemented disagrees, and that >> outweighs the consistency argument in my mind. > > If this were the argument, then SPARQL (a standard) existing practice would > be the better choice! > >> 2. The proposal allows escaping of *all* characters, > > in strings, IRIs and prefixed names. > >> but you don't >> seem to be arguing that unicode escapes should be allowed in the >> strings “@prefix” or “@base” or in the rdf:type shortcut “a”. So it's >> not even that consistent. > > I suggested some time ago that Turtle adopt the current SPARQL approach of > putting unicode escape processing into the input stream. > With both implementer and editor hat on I support using the SPARQL model for input stream processing of escapes for Turtle. The only issues I see are for N-Triples, which may resolve escaping another way. There are also some funky issues with error reporting (as the parser won't the see the escape sequence but only the character produced) > That got entangled with the proposal to expand the range of characters that > can go a prefix name. > Yes, and shouldn't have been. They can easily be resolved separately. Likely easier to resolve separately. > That distorted the discussion - the input stream processing approach is > content-neutral (yes - it's more consistent). > > Now we have two different discussions interacting and holding everything up. > >> ... querying DBpedia .. > > SPARQL right? > > What parts of what specs are you invoking here? > >> <http://dbpedia.org/resource/Éire> – Doesn't work :-( > > Looks like an IRI to me. > > RFC 3987: > > iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar > > ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF > / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD > / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD > / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD > / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD > / %xD0000-DFFFD / %xE1000-EFFFD > > SPARQL: > [70] IRI_REF ::= '<' ([^<>"{}|^`\]-[#x00-#x20])* '>' > > Both include 00C9 > > For the reader > Code point C9 is É > %C3%89 is the URI endoing of the UTF bytes for É/ > > (the fact the *URI* will be %C3%89 is due to RFC 3986/7) > >> <http://dbpedia.org/resource/\u00C9ire> – Doesn't work :-( > > That's legal in SPARQL 1.0. > In fact it's defined to be exactly > <http://dbpedia.org/resource/Éire> > >> <http://dbpedia.org/resource/%C3%89ire> – Works! > > Encoding and escaping are different. > > My conclusion: DBpedia has internally stored the data in %-encoding form. > Unicode escapes are a different issue. > > RFC 3896 makes it complicated because *some* encodings can be reversed > (optionally) and some must not. %20 is not a space. it's %-2-0 > > See > 2.4. When to Encode or Decode > > "Once produced, a URI is always in its percent-encoded form. > ... > The only exception is for > percent-encoded octets corresponding to characters in the unreserved > set, which can be decoded at any time. > " > > (the "can be" is actually a nuisance because some systems do and some don't) > > There is also a debate as to whether RDF is "producing" URIs: > > """ > Under normal circumstances, the only time when octets within a URI > are percent-encoded is during the process of producing the URI from > its component parts. > """ > > > >> // Strange… So what about prefixed names? >> >> dbpedia:%C3%89ire – Doesn't work :-( > > Encoding and %xx issue. > > Should we add %xx to prefix local names? > >> dbpedia:Éire – Doesn't work :-( > > SPARQL 1.0: > > [100] PN_LOCAL ::= PN_CHARS_U ... > [96] PN_CHARS_U ::= PN_CHARS_BASE | '_' > [95] PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | ... > > Turtle: > > [100s] <PN_LOCAL> ::= > ( PN_CHARS_U | [0-9] ) ( ( PN_CHARS | "." )* PN_CHARS )? > [96s] <PN_CHARS_U> ::= > PN_CHARS_BASE | "_" > > [95s] <PN_CHARS_BASE> ::= > [A-Z] | [a-z] | [#00C0-#00D6] | [#00D8-#00F6] | .... > > > Turtle submission: > > [27] qname ::= prefixName? ':' name? > [32] name ::= nameStartChar nameChar* > [30] nameStartChar ::= [A-Z] | "_" | [a-z] | [#x00C0-#x00D6] | ... > > Looks legal to me. > >> dbpedia:\u00C9ire – Doesn't work :-( > > Legal SPARQL 1.0 > >> dbpedia:\u00C3\u0089ire – Doesn't work :-( > > Correct \u00C3 is % - not legal. > > Confuses UTF-8 and codepoint: \u is a codepoint. > >> >> // Oh well, back to IRIs I guess. >> >> Now the proposal adds to that mess by adding *another* way of writing >> things differently with *no* increase in expressivity. (The results >> for all the cases above are unaffected by the proposal – the DBpedia >> IRI simply cannot be written as a prefixed name.) > > I was careful not including the change in expressivity in order to seek a > compromise. > > I'm trying to remove a block of SPARQL publishing because currently there is > a WG note in the spec and, because it's "either-or" it can't be handled so > easily as an at-risk feature in CR. > > (You make an excellent argument for the SPARQL approach. Leaves all current > valid Turtle data as valid.) > >> As it stands, none of the following IRIs can be written as prefixed >> names – they all have to be written as full IRIs: >> >> 1.<%C3%89ire> > > This isn't about encoding. > >> 2.<search?q=eire> >> 3.<Galway,_Ireland> >> 4.<Éire> if you don't know how to type É but know that you can use \u00C9 >> instead > > Aside from the fact it's relative, why not? > >> 5.<U.S.> > > What have trailing dots got to do with unicode escapes? > > [99s] <PN_PREFIX> ::= PN_CHARS_BASE ( ( PN_CHARS | "." )* PN_CHARS )? > >> 6.<United%20Kingdom> > > use of % - not about unicode escapes. > > My suggestion is not expanding the range of characters that are, or are not, > allowed in a prefix name but I'm open to adding %xx. > >> The proposal adds a whole bunch of complexity to the story that one >> needs to tell to explain how the hell prefixed names work, and what >> we get in return is a solution for the case that matters least – >> number 4 – while all the others still don't work and require falling >> back to full IRIs. > > What about compatibility? > >> Escaping in IRIs and literals is necessary for backwards >> compatibility and for Oracle's ASCII-Triples. Adding escaping to >> prefixed names is *not* necessary as there is already a way of >> escaping them: expand to a full IRI and use unicode escapes there. >> >> Best, Richard > >
Received on Tuesday, 22 November 2011 21:26:49 UTC