- From: Etan Wexler <ewexler@stickdog.com>
- Date: Sat, 09 Jul 2005 16:06:47 -0400
- To: Semantic-Web Interest Group <semantic-web@w3.org>, James Cerra <jfcst24_public@yahoo.com>
Jimmy Cerra (“James Cerra”) wrote to the Semantic-Web-Interest-Group
list (<mailto:semantic-web@w3.org>) on 23 June 2005 in “URI Reference
questions” (<mid:20050624035130.64868.qmail@web42201.mail.yahoo.com>,
<http://www.w3.org/mid/20050624035130.64868.qmail@web42201.mail.yahoo.com>):
> URIRefs are always encoded in UTF-8 too. Correct?
I’m not sure that I understand the intended question. Taking your
phrasing literally, the answer is “no”. URIRefs are sequences of
characters. Characters are abstractions that have many possible
encodings. Thus URIRefs have many possible encodings.
> Say we have the URIRef:
>
> <data:,Hello, World>
>
> Is that legal?
It is a legal URIRef.
> Would that be converted into the URI:
>
> <data:,Hello%2C%20World>
No, that URI is not the URI produced by RDF’s regulations. Let’s follow
procedure.
The first step is “encoding the Unicode string as UTF-8, giving a
sequence of octet values.” The octet sequence (in hexadecimal notation) is:
<64 61 74 61 3A 2C 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64>.
The second step is “[percent]-escaping octets that do not correspond to
permitted US-ASCII characters.” Furthermore, “The disallowed octets that
must be [percent]-escaped include all those that do not correspond to
US-ASCII characters, and the excluded characters listed in Section 2.4
of [URI], except for the number sign (#), percent sign (%), and the
square bracket characters re-allowed in [RFC-2732].”
The citation for URI is of RFC 2396. “Excluded US-ASCII Characters”,
section 2.4.3 of RFC 2396, gives the following grammar rules.
control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Mapping characters to the octets that US-ASCII employs to encode those
characters, but leaving out the characters specially allowed by RDF, we
have as follows, re-ordered to ascend by octet value.
(00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 1A, 1B, 1C, 1D, 1E, 1F, 20, 22, 3C, 3E,
5C, 5E, 60, 7B, 7C, 7D, 7F)
Again, the octets of the UTF-8 encoding of the URIRef in question:
<64 61 74 61 3A 2C 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64>
Inspecting octet-by-octet, one finds that none of the encoded URIRef’s
octets need percent-escaping.
> Since the comma is illegal in the URI after the first one?
The illegality of the comma is a matter of the “data” scheme. RDF has no
knowledge of schemes.
> 3) If there was an ambigious situation, how would it be represented as an
> URIRef?
There may be situations that confuse humans, but the specification
leaves no ambiguity.
> For example take the URI (yes, it is an unusual case where the name
> contains a slash - it is just an example):
>
> <http://example.com/name%2Fslash/>
>
> Would that be converted to the URIRef:
>
> <http://example.com/name/slash/>
No. There is no provision for reversing percent-escapes.
> But wouldn't the URIRef:
>
> <http://example.com/name%2Fslash/>
>
> be converted to the URI:
>
> <http://example.com/name%252Fslash/>
No. “Percent” signs in URIRefs must remain as they are in the conversion
to URIs. Again: “The disallowed octets that must be [percent]-escaped
include all those that do not correspond to US-ASCII characters, and the
excluded characters listed in Section 2.4 of [URI], except for the
number sign (#), percent sign (%), and the square bracket characters
re-allowed in [RFC-2732].”
> [1] http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref
--
Etan Wexler.
Received on Saturday, 9 July 2005 20:03:50 UTC