URI Reference conversion to URI

Jimmy Cerra (“James Cerra”) wrote to the Semantic-Web-Interest-Group 
list (<mailto:semantic-web@w3.org>) on 23 June 2005 in “URI Reference 
questions” (<mid:20050624035130.64868.qmail@web42201.mail.yahoo.com>, 
<http://www.w3.org/mid/20050624035130.64868.qmail@web42201.mail.yahoo.com>):

> URIRefs are always encoded in UTF-8 too.  Correct?

I’m not sure that I understand the intended question. Taking your 
phrasing literally, the answer is “no”. URIRefs are sequences of 
characters. Characters are abstractions that have many possible 
encodings. Thus URIRefs have many possible encodings.

> Say we have the URIRef:
> 
> <data:,Hello, World>
> 
> Is that legal?

It is a legal URIRef.

> Would that be converted into the URI:
> 
> <data:,Hello%2C%20World>

No, that URI is not the URI produced by RDF’s regulations. Let’s follow 
procedure.

The first step is “encoding the Unicode string as UTF-8, giving a 
sequence of octet values.” The octet sequence (in hexadecimal notation) is:

<64 61 74 61 3A 2C 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64>.

The second step is “[percent]-escaping octets that do not correspond to 
permitted US-ASCII characters.” Furthermore, “The disallowed octets that 
must be [percent]-escaped include all those that do not correspond to 
US-ASCII characters, and the excluded characters listed in Section 2.4 
of [URI], except for the number sign (#), percent sign (%), and the 
square bracket characters re-allowed in [RFC-2732].”

The citation for URI is of RFC 2396. “Excluded US-ASCII Characters”, 
section 2.4.3 of RFC 2396, gives the following grammar rules.

    control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
    space       = <US-ASCII coded character 20 hexadecimal>
    delims      = "<" | ">" | "#" | "%" | <">
    unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Mapping characters to the octets that US-ASCII employs to encode those 
characters, but leaving out the characters specially allowed by RDF, we 
have as follows, re-ordered to ascend by octet value.

(00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F, 10, 11, 
12, 13, 14, 15, 16, 17, 18, 19, 1A, 1B, 1C, 1D, 1E, 1F, 20, 22, 3C, 3E, 
5C, 5E, 60, 7B, 7C, 7D, 7F)

Again, the octets of the UTF-8 encoding of the URIRef in question:

<64 61 74 61 3A 2C 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64>

Inspecting octet-by-octet, one finds that none of the encoded URIRef’s 
octets need percent-escaping.

> Since the comma is illegal in the URI after the first one?

The illegality of the comma is a matter of the “data” scheme. RDF has no 
knowledge of schemes.

> 3) If there was an ambigious situation, how would it be represented as an
> URIRef?

There may be situations that confuse humans, but the specification 
leaves no ambiguity.

> For example take the URI (yes, it is an unusual case where the name
> contains a slash - it is just an example):
> 
> <http://example.com/name%2Fslash/>
> 
> Would that be converted to the URIRef:
> 
> <http://example.com/name/slash/>

No. There is no provision for reversing percent-escapes.

> But wouldn't the URIRef:
> 
> <http://example.com/name%2Fslash/>
> 
> be converted to the URI:
> 
> <http://example.com/name%252Fslash/>

No. “Percent” signs in URIRefs must remain as they are in the conversion 
to URIs. Again: “The disallowed octets that must be [percent]-escaped 
include all those that do not correspond to US-ASCII characters, and the 
excluded characters listed in Section 2.4 of [URI], except for the 
number sign (#), percent sign (%), and the square bracket characters 
re-allowed in [RFC-2732].”

> [1] http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref

-- 
Etan Wexler.

Received on Saturday, 9 July 2005 20:03:50 UTC