- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 03 Feb 2003 14:08:16 -0500
- To: "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
- Cc: Michel Suignard <michelsu@microsoft.com>, www-international@w3.org
At 20:20 03/01/27 -0500, Ian B. Jacobs wrote: >Hello, > >Minutes of the 27 Jan 2003 TAG teleconf available as >HTML [1] and as text below. > 2.3 IRIEverywhere-27 > [25] http://www.w3.org/2001/tag/ilist#IRIEverywhere-27 > > [Ian] > TB: Lots of software is already treating %7e and %7E as the > same. > > [Chris] > that they are equivalent to the *actual character* represented > uri spec is way fuzzy on this > actual practice is that they are the same > > [Ian] > TB: Per 2396, I think Web robots are in their rights; lots of > Web robots do this. > RF: Yes, that's my understanding. > PC: Chris, could you explain impact? > > [DanCon] > I think it's straightforward to read the URI spec as saying > that [26]http://a/%7E and [27]http://a/%7e are distinct URIs, > and may or may not refer to the same resource. > roy, you think otherwise? > > [26] http://a/%7E > [27] http://a/%7e > > [Roy] > I think otherwise > > [Stuart] > I read it the same as Dan :-( > [DanCon] > sigh; each of those URIs is a sequence of 12 characters. they > differ in their 12th character. hence they're different URIs. > RFC2396 says otherwise? > [Chris] > it means that the *actual kanji* and the sequence of hexifyied > octets compare to the same > which helps in roundtripping a very great deal > [TBray] > was ignoring IRC... yes, lots of software will decide those two > URIs are the same in their cache > > [Chris] > no, %7e is one octet > [Zakim] > DanCon, you wanted to suggest the value of having %7E specified to be > equivalent to %7e is purely aesthetic, and not *nearly* worth > the cost. So let's look at RFC 2396 on this: First, we find in '1.1 Overview of URI' URI are characterized by the following definitions: Uniform Uniformity provides several benefits: it allows different types of resource identifiers to be used in the same context, even when the mechanisms used to access those resources may differ; it allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers; The %-escape convention can definitely be seen as a 'common syntactic convention'. Next, "2.1 URI and non-ASCII characters" says: A URI scheme may define a mapping from URI characters to octets; whether this is done depends on the scheme. Commonly, within a delimited component of a URI, a sequence of characters may be used to represent a sequence of octets. For example, the character "a" represents the octet 97 (decimal), while the character sequence "%", "0", "a" represents the octet 10 (decimal). This is rather confusing. It says 'a scheme may define', but then goes on positively as 'the character "a" represents the octet 97 (decimal)'. There are several ways to read this, one of which being that a scheme can define additional (or alternative??) ways to represent octets; base64 as used in the data: uri scheme would then be an example. I have requested that an issue be opened for rfc2396bis for this. (see http://lists.w3.org/Archives/Public/uri/2003Jan/0025.html) More from this section: In the simplest case, the original character sequence contains only characters that are defined in US-ASCII, and the two levels of mapping are simple and easily invertible: each 'original character' is represented as the octet for the US-ASCII code for it, which is, in turn, represented as either the US-ASCII character, or else the "%" escape sequence for that octet. It is unclear whether the 'either/or' in the last sentence means 'you can always choose either one', or whether it means 'you use the US-ASCII character if that's available, and otherwise escape. 'always can choose either one' would definitely only apply within the restriction of unreserved characters. For reserved characters, the escaped and unescaped forms have different meanings. From "2.4.2. When to Escape and Unescape" we have: A URI is always in an "escaped" form, since escaping or unescaping a completed URI might change its semantics. Normally, the only time escape encodings can safely be made is when the URI is being created from its component parts; each component may have its own set of characters that are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its semantics. Likewise, a URI must be separated into its components before the escaped characters within those components can be safely decoded. This seems to say that escaping or unescaping can only be done with detailled knowledge of the scheme. However, this could be read in two ways: 1) it is unclear which characters exactly from the 'reserved' category will actually be reserved somewhere, so these considerations apply only to these characters, and not to the unreserved ones; or 2) this applies to all characters, despite the fact that there is a 'reserved' character category, *any* character could be potentially reserved in a new scheme. From the same section: In some cases, data that could be represented by an unreserved character may appear escaped; for example, some of the unreserved "mark" characters are automatically escaped by some systems. If the given URI scheme defines a canonicalization algorithm, then unreserved characters may be unescaped according to that algorithm. For example, "%7e" is sometimes used instead of "~" in an http URL path, but the two are equivalent for an http URL. This says 'treat "%7e" and "~" as equivalent only if you know about the scheme. This seems to point pretty strongly in one direction. So my conclusion is that RFC 2396 contains very little text about equivalence between "%7e" and "%7E", and is not very conclusive about the equivalence between "%7e/E" and "~" either. The strongest argumentation for these equivalences is probably that the escape syntax appears in most parts of the generic uri syntax and also in the opaque syntax, and that (with exception of reserved characters) the spec says that all three represent the same octet. Regards, Martin.
Received on Monday, 3 February 2003 14:13:05 UTC