Re: unicode escapes in prefix names

On Tue, Nov 22, 2011 at 1:04 PM, Andy Seaborne
<andy.seaborne@epimorphics.com> wrote:
> Richard,
>
> With a goal of maximising compatibility between Turtle and SPARQL,
> maximising compatibility from both heritiages is important.
>
> SPARQL 1.0 allows \u in prefix names (and in fact uniformly)
>
> SPARQL is already changing to accommodate Turtle in a major way for
> implementers
>
> Turtle can make a smaller change to accommodate SPARQL.
> (smaller because it does not change the design of a Turtle parser as it does
> to a SPARQL one)
>
> I'm open to adding %-support to prefix names (Turtle and SPARQL) but that is
> really a separate issue.
>
>        Andy
>
> More inline - some of your examples are about %-encoding in prefixed names
> and not about unicode escapes.
>
> On 22/11/11 19:48, Richard Cyganiak wrote:
> ...
>>>
>>> If you have them at all anywhere, having them where the real
>>> characters are already legal seems consistent.
>>
>> 1. Turtle as existing and as implemented disagrees, and that
>> outweighs the consistency argument in my mind.
>
> If this were the argument, then SPARQL (a standard) existing practice would
> be the better choice!
>
>> 2. The proposal allows escaping of *all* characters,
>
> in strings, IRIs and prefixed names.
>
>> but you don't
>> seem to be arguing that unicode escapes should be allowed in the
>> strings “@prefix” or “@base” or in the rdf:type shortcut “a”. So it's
>> not even that consistent.
>
> I suggested some time ago that Turtle adopt the current SPARQL approach of
> putting unicode escape processing into the input stream.
>

With both implementer and editor hat on I support using the SPARQL
model for input stream processing of escapes for Turtle. The only
issues I see are for N-Triples, which may resolve escaping another
way. There are also some funky issues with error reporting (as the
parser won't the see the escape sequence but only the character
produced)

> That got entangled with the proposal to expand the range of characters that
> can go a prefix name.
>

Yes, and shouldn't have been. They can easily be resolved separately.
Likely easier to resolve separately.

> That distorted the discussion - the input stream processing approach is
> content-neutral (yes - it's more consistent).
>
> Now we have two different discussions interacting and holding everything up.
>
>> ... querying DBpedia ..
>
> SPARQL right?
>
> What parts of what specs are you invoking here?
>
>> <http://dbpedia.org/resource/Éire>       – Doesn't work :-(
>
> Looks like an IRI to me.
>
> RFC 3987:
>
>  iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
>
>   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                  / %xD0000-DFFFD / %xE1000-EFFFD
>
> SPARQL:
> [70] IRI_REF ::= '<' ([^<>"{}|^`\]-[#x00-#x20])* '>'
>
> Both include 00C9
>
> For the reader
> Code point C9 is É
> %C3%89 is the URI endoing of the UTF bytes for É/
>
> (the fact the *URI* will be %C3%89 is due to RFC 3986/7)
>
>> <http://dbpedia.org/resource/\u00C9ire>  – Doesn't work :-(
>
> That's legal in SPARQL 1.0.
> In fact it's defined to be exactly
> <http://dbpedia.org/resource/Éire>
>
>> <http://dbpedia.org/resource/%C3%89ire>  – Works!
>
> Encoding and escaping are different.
>
> My conclusion: DBpedia has internally stored the data in %-encoding form.
>  Unicode escapes are a different issue.
>
> RFC 3896 makes it complicated because *some* encodings can be reversed
> (optionally) and some must not.  %20 is not a space.  it's %-2-0
>
> See
> 2.4.  When to Encode or Decode
>
> "Once produced, a URI is always in its percent-encoded form.
> ...
> The only exception is for
>   percent-encoded octets corresponding to characters in the unreserved
>   set, which can be decoded at any time.
> "
>
> (the "can be" is actually a nuisance because some systems do and some don't)
>
> There is also a debate as to whether RDF is "producing" URIs:
>
> """
>   Under normal circumstances, the only time when octets within a URI
>   are percent-encoded is during the process of producing the URI from
>   its component parts.
> """
>
>
>
>> // Strange… So what about prefixed names?
>>
>> dbpedia:%C3%89ire       – Doesn't work :-(
>
> Encoding and %xx issue.
>
> Should we add %xx to prefix local names?
>
>> dbpedia:Éire            – Doesn't work :-(
>
> SPARQL 1.0:
>
> [100] PN_LOCAL ::= PN_CHARS_U ...
> [96]  PN_CHARS_U ::= PN_CHARS_BASE | '_'
> [95]  PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | ...
>
> Turtle:
>
> [100s] <PN_LOCAL>  ::=
>         ( PN_CHARS_U | [0-9] ) ( ( PN_CHARS | "." )* PN_CHARS )?
> [96s] <PN_CHARS_U> ::=
>        PN_CHARS_BASE | "_"
>
> [95s] <PN_CHARS_BASE> ::=
>        [A-Z] | [a-z] | [#00C0-#00D6] | [#00D8-#00F6] | ....
>
>
> Turtle submission:
>
> [27]    qname   ::= prefixName? ':' name?
> [32]    name    ::= nameStartChar nameChar*
> [30]    nameStartChar ::= [A-Z] | "_" | [a-z] | [#x00C0-#x00D6] | ...
>
> Looks legal to me.
>
>> dbpedia:\u00C9ire       – Doesn't work :-(
>
> Legal SPARQL 1.0
>
>> dbpedia:\u00C3\u0089ire – Doesn't work :-(
>
> Correct \u00C3 is % - not legal.
>
> Confuses UTF-8 and codepoint: \u is a codepoint.
>
>>
>> // Oh well, back to IRIs I guess.
>>
>> Now the proposal adds to that mess by adding *another* way of writing
>> things differently with *no* increase in expressivity. (The results
>> for all the cases above are unaffected by the proposal – the DBpedia
>> IRI simply cannot be written as a prefixed name.)
>
> I was careful not including the change in expressivity in order to seek a
> compromise.
>
> I'm trying to remove a block of SPARQL publishing because currently there is
> a WG note in the spec and, because it's "either-or" it can't be handled so
> easily as an at-risk feature in CR.
>
> (You make an excellent argument for the SPARQL approach.  Leaves all current
> valid Turtle data as valid.)
>
>> As it stands, none of the following IRIs can be written as prefixed
>> names – they all have to be written as full IRIs:
>>
>> 1.<%C3%89ire>
>
> This isn't about encoding.
>
>> 2.<search?q=eire>
>> 3.<Galway,_Ireland>
>> 4.<Éire>  if you don't know how to type É but know that you can use \u00C9
>> instead
>
> Aside from the fact it's relative, why not?
>
>> 5.<U.S.>
>
> What have trailing dots got to do with unicode escapes?
>
> [99s] <PN_PREFIX>  ::= PN_CHARS_BASE ( ( PN_CHARS | "." )* PN_CHARS )?
>
>> 6.<United%20Kingdom>
>
> use of % - not about unicode escapes.
>
> My suggestion is not expanding the range of characters that are, or are not,
> allowed in a prefix name but I'm open to adding %xx.
>
>> The proposal adds a whole bunch of complexity to the story that one
>> needs to tell to explain how the hell prefixed names work, and what
>> we get in return is a solution for the case that matters least –
>> number 4 – while all the others still don't work and require falling
>> back to full IRIs.
>
> What about compatibility?
>
>> Escaping in IRIs and literals is necessary for backwards
>> compatibility and for Oracle's ASCII-Triples. Adding escaping to
>> prefixed names is *not* necessary as there is already a way of
>> escaping them: expand to a full IRI and use unicode escapes there.
>>
>> Best, Richard
>
>

Received on Tuesday, 22 November 2011 21:26:49 UTC