- From: Etan Wexler <ewexler@stickdog.com>
- Date: Sat, 09 Jul 2005 16:06:47 -0400
- To: Semantic-Web Interest Group <semantic-web@w3.org>, James Cerra <jfcst24_public@yahoo.com>
Jimmy Cerra (“James Cerra”) wrote to the Semantic-Web-Interest-Group list (<mailto:semantic-web@w3.org>) on 23 June 2005 in “URI Reference questions” (<mid:20050624035130.64868.qmail@web42201.mail.yahoo.com>, <http://www.w3.org/mid/20050624035130.64868.qmail@web42201.mail.yahoo.com>): > URIRefs are always encoded in UTF-8 too. Correct? I’m not sure that I understand the intended question. Taking your phrasing literally, the answer is “no”. URIRefs are sequences of characters. Characters are abstractions that have many possible encodings. Thus URIRefs have many possible encodings. > Say we have the URIRef: > > <data:,Hello, World> > > Is that legal? It is a legal URIRef. > Would that be converted into the URI: > > <data:,Hello%2C%20World> No, that URI is not the URI produced by RDF’s regulations. Let’s follow procedure. The first step is “encoding the Unicode string as UTF-8, giving a sequence of octet values.” The octet sequence (in hexadecimal notation) is: <64 61 74 61 3A 2C 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64>. The second step is “[percent]-escaping octets that do not correspond to permitted US-ASCII characters.” Furthermore, “The disallowed octets that must be [percent]-escaped include all those that do not correspond to US-ASCII characters, and the excluded characters listed in Section 2.4 of [URI], except for the number sign (#), percent sign (%), and the square bracket characters re-allowed in [RFC-2732].” The citation for URI is of RFC 2396. “Excluded US-ASCII Characters”, section 2.4.3 of RFC 2396, gives the following grammar rules. control = <US-ASCII coded characters 00-1F and 7F hexadecimal> space = <US-ASCII coded character 20 hexadecimal> delims = "<" | ">" | "#" | "%" | <"> unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" Mapping characters to the octets that US-ASCII employs to encode those characters, but leaving out the characters specially allowed by RDF, we have as follows, re-ordered to ascend by octet value. (00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1A, 1B, 1C, 1D, 1E, 1F, 20, 22, 3C, 3E, 5C, 5E, 60, 7B, 7C, 7D, 7F) Again, the octets of the UTF-8 encoding of the URIRef in question: <64 61 74 61 3A 2C 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64> Inspecting octet-by-octet, one finds that none of the encoded URIRef’s octets need percent-escaping. > Since the comma is illegal in the URI after the first one? The illegality of the comma is a matter of the “data” scheme. RDF has no knowledge of schemes. > 3) If there was an ambigious situation, how would it be represented as an > URIRef? There may be situations that confuse humans, but the specification leaves no ambiguity. > For example take the URI (yes, it is an unusual case where the name > contains a slash - it is just an example): > > <http://example.com/name%2Fslash/> > > Would that be converted to the URIRef: > > <http://example.com/name/slash/> No. There is no provision for reversing percent-escapes. > But wouldn't the URIRef: > > <http://example.com/name%2Fslash/> > > be converted to the URI: > > <http://example.com/name%252Fslash/> No. “Percent” signs in URIRefs must remain as they are in the conversion to URIs. Again: “The disallowed octets that must be [percent]-escaped include all those that do not correspond to US-ASCII characters, and the excluded characters listed in Section 2.4 of [URI], except for the number sign (#), percent sign (%), and the square bracket characters re-allowed in [RFC-2732].” > [1] http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref -- Etan Wexler.
Received on Saturday, 9 July 2005 20:03:50 UTC