Re: Colon symbol in URI?

On Fri, Aug 12, 2011 at 12:07 PM, Melanie Courtot <mcourtot@gmail.com> wrote:
> Hi,
>
> The part after the : is a QName, and the relevant spec is at [1]. It does forbid to use a digit as first character after the colon.
>
> While we were working on the OBO ID policy [2], Jonathan Rees (cc) mentioned there was a proposal to relax those constraints by using CURIEs [3] instead of QNames, but I didn't check it and don't know its status; he may be able to add more information.
>
> Cheers,
> Melanie
>
> [1] http://www.w3.org/TR/REC-xml-names/#NT-QName
> [2] http://www.obofoundry.org/id-policy.shtml
> [3] http://www.w3.org/2001/sw/BestPractices/HTML/2005-10-27-CURIE

Two questions here: whether a digit can immediately follow the
namespace prefix, and whether a second : can follow it. I'll take them
in order.

I think the theory of abbreviated URIs has played out as follows:

- The prefix : suffix pattern is generically called concise URI or
'curie', see http://www.w3.org/TR/curie/
- There are three instantiations of the Curie pattern in current specs
    - XML Qnames are the Curies of RDF/XML.  They require the suffix
to be an 'NCname', which has to start with a letter or _
    - SPARQL and the newly revised Turtle draft specification
http://www.w3.org/TR/turtle/ have more liberal Curies. Their Curies
allow an empty suffix, a digit after the colon, and possible other
goodies.
    - RDFa also has Curies which I believe to be a superset of
SPARQL/Turtle Curies.  I don't know the details but you should be able
to find them in http://www.w3.org/TR/xhtml-rdfa/ (I just spent 2
minutes and failed though).

When Curies occur in RDF/XML, they do so as element names, not inside
strings. XML is not going to change, and RDF/XML even if it does get
reissued will still be tied to XML, so it's not going to change. If I
said anything about syntax liberalization, it was probably in
reference to Turtle, which has indeed changed in the way I expected.

But URI syntax, in particular whether you can put two colons in a URI
(i.e. RDF URI Reference and/or IRI), is not up to any of these
specifications. That would be up to RFC 3986, which delegates to RFC
2616, which delegates to RFC 2396, which says that : is reserved and
has to be %-escaped. In practice I suspect this is not always done,
and perhaps the new IRI spec (in progress) will say something about
that.

One could imagine Turtle or RDFa allowing : in their Curies, which
would then be %-escaped when the Curie is converted to a URI, but I
doubt this will be done.

The subject of URI syntax is so complicated, and the quoting rules so
impenetrable and specification compliance so poor, that I recommend
using them in the most syntactically conservative way possible, so as
to stay out of trouble. I'd say eliminate the : one way or another in
the process of converting any internal name into a URI. This appears
to be what the Foundry approach does, but conversion of : to _ is
specific to Foundry identifiers; there's no reason to think
:-containing identifiers coming from another identifier space should
convert to _ as opposed to - or + or / or %3A. The RFCs suggest that
%-escaping would be the right way to put a second : in a URI, but
there is no reason to be uniform or slavish about this. If a second
colon occurs in a would-be URI, and the introduction of _ is being
left up to Protege, then that's too late in the process. How would
Protege know to use the Foundry rule for converting :, as opposed to
(say) %-escaping, which might be correct for some different identifier
source? Get rid of the : before putting the id into Protege in the
first place.

Best
Jonathan

Received on Friday, 12 August 2011 17:51:30 UTC