Re: unicode escapes in prefix names from Andy Seaborne on 2011-11-22 (public-rdf-wg@w3.org from November 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Tue, 22 Nov 2011 21:04:42 +0000
To: Richard Cyganiak <richard@cyganiak.de>
CC: RDF-WG <public-rdf-wg@w3.org>
Message-ID: <4ECC0E6A.8070902@epimorphics.com>
Richard,

With a goal of maximising compatibility between Turtle and SPARQL, 
maximising compatibility from both heritiages is important.

SPARQL 1.0 allows \u in prefix names (and in fact uniformly)

SPARQL is already changing to accommodate Turtle in a major way for 
implementers

Turtle can make a smaller change to accommodate SPARQL.
(smaller because it does not change the design of a Turtle parser as it 
does to a SPARQL one)

I'm open to adding %-support to prefix names (Turtle and SPARQL) but 
that is really a separate issue.

 Andy

More inline - some of your examples are about %-encoding in prefixed 
names and not about unicode escapes.

On 22/11/11 19:48, Richard Cyganiak wrote:
...
>> If you have them at all anywhere, having them where the real
>> characters are already legal seems consistent.
>
> 1. Turtle as existing and as implemented disagrees, and that
> outweighs the consistency argument in my mind.

If this were the argument, then SPARQL (a standard) existing practice 
would be the better choice!

> 2. The proposal allows escaping of *all* characters,

in strings, IRIs and prefixed names.

> but you don't
> seem to be arguing that unicode escapes should be allowed in the
> strings “@prefix” or “@base” or in the rdf:type shortcut “a”. So it's
> not even that consistent.

I suggested some time ago that Turtle adopt the current SPARQL approach 
of putting unicode escape processing into the input stream.

That got entangled with the proposal to expand the range of characters 
that can go a prefix name.

That distorted the discussion - the input stream processing approach is 
content-neutral (yes - it's more consistent).

Now we have two different discussions interacting and holding everything up.

 > ... querying DBpedia ..

SPARQL right?

What parts of what specs are you invoking here?

> <http://dbpedia.org/resource/Éire>       – Doesn't work :-(

Looks like an IRI to me.

RFC 3987:

  iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

    ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                   / %xD0000-DFFFD / %xE1000-EFFFD

SPARQL:
[70] IRI_REF ::= '<' ([^<>"{}|^`\]-[#x00-#x20])* '>'

Both include 00C9

For the reader
Code point C9 is É
%C3%89 is the URI endoing of the UTF bytes for É/

(the fact the *URI* will be %C3%89 is due to RFC 3986/7)

> <http://dbpedia.org/resource/\u00C9ire>  – Doesn't work :-(

That's legal in SPARQL 1.0.
In fact it's defined to be exactly
<http://dbpedia.org/resource/Éire>

> <http://dbpedia.org/resource/%C3%89ire>  – Works!

Encoding and escaping are different.

My conclusion: DBpedia has internally stored the data in %-encoding 
form.  Unicode escapes are a different issue.

RFC 3896 makes it complicated because *some* encodings can be reversed 
(optionally) and some must not.  %20 is not a space.  it's %-2-0

See
2.4.  When to Encode or Decode

"Once produced, a URI is always in its percent-encoded form.
...
The only exception is for
    percent-encoded octets corresponding to characters in the unreserved
    set, which can be decoded at any time.
"

(the "can be" is actually a nuisance because some systems do and some don't)

There is also a debate as to whether RDF is "producing" URIs:

"""
    Under normal circumstances, the only time when octets within a URI
    are percent-encoded is during the process of producing the URI from
    its component parts.
"""



> // Strange… So what about prefixed names?
>
> dbpedia:%C3%89ire       – Doesn't work :-(

Encoding and %xx issue.

Should we add %xx to prefix local names?

 > dbpedia:Éire            – Doesn't work :-(

SPARQL 1.0:

[100] PN_LOCAL ::= PN_CHARS_U ...
[96]  PN_CHARS_U ::= PN_CHARS_BASE | '_'
[95]  PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | ...

Turtle:

[100s] <PN_LOCAL>  ::=
          ( PN_CHARS_U | [0-9] ) ( ( PN_CHARS | "." )* PN_CHARS )?
[96s] <PN_CHARS_U> ::=
       PN_CHARS_BASE | "_"

[95s] <PN_CHARS_BASE> ::=
     [A-Z] | [a-z] | [#00C0-#00D6] | [#00D8-#00F6] | ....


Turtle submission:

[27] qname  ::= prefixName? ':' name?
[32] name  ::= nameStartChar nameChar*
[30] nameStartChar ::= [A-Z] | "_" | [a-z] | [#x00C0-#x00D6] | ...

Looks legal to me.

 > dbpedia:\u00C9ire       – Doesn't work :-(

Legal SPARQL 1.0

> dbpedia:\u00C3\u0089ire – Doesn't work :-(

Correct \u00C3 is % - not legal.

Confuses UTF-8 and codepoint: \u is a codepoint.

>
> // Oh well, back to IRIs I guess.
>
> Now the proposal adds to that mess by adding *another* way of writing
> things differently with *no* increase in expressivity. (The results
> for all the cases above are unaffected by the proposal – the DBpedia
> IRI simply cannot be written as a prefixed name.)

I was careful not including the change in expressivity in order to seek 
a compromise.

I'm trying to remove a block of SPARQL publishing because currently 
there is a WG note in the spec and, because it's "either-or" it can't be 
handled so easily as an at-risk feature in CR.

(You make an excellent argument for the SPARQL approach.  Leaves all 
current valid Turtle data as valid.)

> As it stands, none of the following IRIs can be written as prefixed
> names – they all have to be written as full IRIs:
>
> 1.<%C3%89ire>

This isn't about encoding.

 > 2.<search?q=eire>
 > 3.<Galway,_Ireland>
 > 4.<Éire>  if you don't know how to type É but know that you can use 
\u00C9 instead

Aside from the fact it's relative, why not?

> 5.<U.S.>

What have trailing dots got to do with unicode escapes?

[99s] <PN_PREFIX>  ::= PN_CHARS_BASE ( ( PN_CHARS | "." )* PN_CHARS )?

 > 6.<United%20Kingdom>

use of % - not about unicode escapes.

My suggestion is not expanding the range of characters that are, or are 
not, allowed in a prefix name but I'm open to adding %xx.

> The proposal adds a whole bunch of complexity to the story that one
> needs to tell to explain how the hell prefixed names work, and what
> we get in return is a solution for the case that matters least –
> number 4 – while all the others still don't work and require falling
> back to full IRIs.

What about compatibility?

> Escaping in IRIs and literals is necessary for backwards
> compatibility and for Oracle's ASCII-Triples. Adding escaping to
> prefixed names is *not* necessary as there is already a way of
> escaping them: expand to a full IRI and use unicode escapes there.
>
> Best, Richard
Received on Tuesday, 22 November 2011 21:05:25 UTC