- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 03 Feb 2003 14:13:21 -0500
- To: "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
- Cc: www-international@w3.org, Michel Suignard <michelsu@microsoft.com>
Dear TAG, others, Please allow me a few comments (and hopefully clarifications) on one of the issues below. This mail will mostly concentrate on one specific issue, namely what RFC 2396 says about 'characters'. Other material in separate mails. At 20:20 03/01/27 -0500, Ian B. Jacobs wrote: >Hello, > >Minutes of the 27 Jan 2003 TAG teleconf available as >HTML [1] and as text below. > 2.3 IRIEverywhere-27 While IRIEverywhere-27 is related to URIEquivalence-15, and URIEquivalence-15 is of high importance to IRIs, most of the discussion below is actually about URIEquivalence-15. > 1. [25]IRIEverywhere-27 > 1. Action MD and CL 2002/11/18: Write up text about > IRIEverywhere-27 for spec writers to include in their spec. I sent some text to Chris quite a while. For the benefit of everybody, I'll send this as a separate mail. > 2. Action CL 2002/11/18: Write up finding for IRIEverywhere-27 > (from TB and TBL, a/b/c), to include MD's text. Process with > IJ; awaiting comments from MD > > [25] http://www.w3.org/2001/tag/ilist#IRIEverywhere-27 > > [Ian] > > CL: IJ and I discussed this last week. We drew up some text.: > Text based on input from Martin Duerst. > DC: Value of treating e and E as equivalent is a huge cost. > What actual value do you get? > CL: Actual value is that both are equivalent to the same > character. > TB: Lots of software is already treating %7e and %7E as the > same. > > [Chris] > that they are equivalent to the *actual character* represented > uri spec is way fuzzy on this > actual practice is that they are the same > > [Ian] > TB: Per 2396, I think Web robots are in their rights; lots of > Web robots do this. > RF: Yes, that's my understanding. > PC: Chris, could you explain impact? > > [DanCon] > I think it's straightforward to read the URI spec I think in general, there is too much 'trying to read the coffee grounds (aka spec)' in what follows. While this is certainly very important (and I'll do some more of it below), actual practice should also be considered carefully. >as saying > that [26]http://a/%7E and [27]http://a/%7e are distinct URIs, > and may or may not refer to the same resource. > roy, you think otherwise? > > [26] http://a/%7E > [27] http://a/%7e > > [Roy] > I think otherwise > > [Stuart] > I read it the same as Dan :-( > > [Ian] > CL: There is a bigger effect on IRI spec and suggestions for > RFC2396. > > [DanCon] > sigh; each of those URIs is a sequence of 12 characters. they > differ in their 12th character. hence they're different URIs. > RFC2396 says otherwise? > > [Chris] > this has more effect on IRI comparison (which is done by > transformation to URI and then comparing) > > [Ian] > Action CL: Please propose text IJ and CL worked on to www-tag > (flipping the ACL). > > [Roy] > %7e is one character -- three octets > > [Chris] > it means that the *actual kanji* and the sequence of hexifyied > octets compare to the same > which helps in roundtripping a very great deal > > [Roy] > oops one octet -- char[$1\47] > > [DanCon] > "%7e" is *one* character??? > > [Roy] > "character" is defined in spec > > [TBray] > was ignoring IRC... yes, lots of software will decide those two > URIs are the same in their cache > > [Chris] > no, %7e is one octet okay, Dan says %7e are three characters, Roy says it's one character (or both actually say that RFC 2396 says so). Let's see: In '1.1 Overview of URI' "Identifier An identifier is an object that can act as a reference to something that has identity. In the case of URI, the object is a sequence of characters with a restricted syntax." That text seems to leave possibilities wide open. In '1.5. URI Transcribability', we have: "A URI is a sequence of characters from a very limited set, i.e. the letters of the basic Latin alphabet, digits, and a few special characters. A URI may be represented in a variety of ways: e.g., ink on paper, pixels on a screen, or a sequence of octets in a coded character set. The interpretation of a URI depends only on the characters used and not how those characters are represented in a network protocol." Although this is not totally conclusive, Dan's reading seems to match here quite a bit better than Roy's. From the same section, a bit later: "* A URI may be transcribed from a non-network source, and thus should consist of characters that are most likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales." Again, although not completely conclusive, this text seems to work much better if we can say that the user types '%7e' as three characters, rather than as one. (not that keyboards can't be configured to produce '%7e' with a single keystroke, but that's very rarely done :-) Again from the same section, a bit later: "For example, it is often the case that the most meaningful name for a URI component would require characters that cannot be typed into some systems." This is referring to characters that might be e.g. written as %FC, or %C3%BC, or so, but by saying 'would require', it seems to somehow imply that such a character is not what is talked about in RFC 2396. From '1.6. Syntax Notation and Common Elements': "Unlike many specifications that use a BNF-like grammar to define the bytes (octets) allowed by a protocol, the URI grammar is defined in terms of characters. Each literal in the grammar corresponds to the character it represents, rather than to the octet encoding of that character in any particular coded character set." This, together with text from '2.4.1. Escaped Encoding': An escaped octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the octet code. For example, "%20" is the escaped encoding for the US-ASCII space character. escaped = "%" hex hex hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" clearly makes it valid to talk about "%" (as well as "7" and "e") as characters. From "2. URI Characters and Escape Sequences": "URI consist of a restricted set of characters, primarily chosen to aid transcribability and usability both in computer systems and in non-computer communications. Characters used conventionally as delimiters around URI were excluded. The restricted set of characters consists of digits, letters, and a few graphic symbols were chosen from those common to most of the character encodings and input facilities available to Internet users. uric = reserved | unreserved | escaped" The heading lets us expect a definition of URI characters. If we interpret 'uric' to stand for 'URI Character', then we have a very clear definition that agrees with Roy. But the text never explicitly says that 'uric' is 'URI Character'. Just immediately following: "Within a URI, characters are either used as delimiters, or to represent strings of data (octets) within the delimited portions. Octets are either represented directly by a character (using the US- ASCII character for that octet [ASCII]) or by an escape encoding." This explains that all characters in an URI are used to represent octets. 'a' represents the octet <61>, '~' represents the octet <7e>, %88 represents the octet <88>, and so on. In this sense, it might even be possible to say that Roy got it backwards, it would be possible to say "%7e is one octet -- three characters" or at least "%7e represents one octet with three characters" rather than "%7e is one character -- three octets" So overall, my conclusion on the question of whether RFC 2396 would talk about '%7e' as one character (Roy) or three (Dan) is that for the most part, three is much more plausible. RFC 2396 defines '%7e' to be one instance of the syntax rule 'uric', but it doesn't explicitly say that 'uric' is a character. Regards, Martin.
Received on Monday, 3 February 2003 14:13:52 UTC