- From: Francois Yergeau <yergeau@alis.com>
- Date: Sun, 13 Apr 1997 22:59:37 -0400
- To: John C Klensin <klensin@mci.net>
- Cc: uri@bunyip.com
À 11:41 12-04-97 -0400, John C Klensin a écrit : >While I'm very anxious to see a real solution that >addresses the underlying issues here, I'm forced to agree >with Larry. We don't "make" things happen by standardize >untested ideas and arguments, however logical, that >things are easy to do don't move the discussion forward >much. Yet this is exactly how HTTP/1.1 was made to happen. Untested things were discussed and put into drafts. Some testing took place along the way, but at some point the spec was declared a Proposed Standard, before there was a single full implementation that embodied what you want here: > ... a demonstration that it works >well, that it won't cause significant problems with >existing (unmodified) clients, servers, or users, etc. By contrast, what we have now is a refusal to even do the first step, to put things into the draft so that the issue can be addressed. > I don't think that timing of standards are much of >the issue here. Indeed, it doesn't matter much if URL syntax becomes Draft Standard now or 6 months later. But it does matter that an unsound spec doesn't make it to DS. URLs are written on paper (characters) and transmitted over the wire (bytes). Thus an unambiguous mapping between characters and bytes is *required*. This mapping currently only exists for only a tiny fraction of possible characters, namely ASCII. Since Web forms are submitted using URLs, and can contain almost any text, it is neither desirable nor possible to restrict the repertoire of characters. The current spec does not recognize this and pretends that (section 2): "All URLs consist of a restricted set of characters, primarily chosen to aid transcribability and usability both in computer systems and in non-computer communications." In other words, it places a purported transcribability requirement ahead of the simple fact that current practice uses other characters all the time. Oh, of course, these non-ASCII characters are escaped to ASCII using %-encoding, but there is still no defined mapping from characters to bytes. And there is no defined mapping from bytes to characters for half the possible byte values, precluding any sensible display of URLs representing non-ASCII characters. In short, the current spec is technically unsound and broken, and needs fixing not to extend it to new capabilities, but to bring it in line with widespread current practice. This discussion has been going on for months in various circles, lists and conferences, with no resolution. The reason, it seems to me, is the continued failure to fully recognize that mapping only ASCII characters in not a solution. While it may be acceptable to restrict bytes over the wire to 7 bits (but why?), it is not to limit the character repertoire to a subset of ASCII. URLs are widely put to uses where there is no such limit. >And, as I have said many times before, while I recognize >and accept the enthusiasm for UTF-8, especially among users >of languages with Latin-based alphabetic systems, I would >prefer that, when we make protocol decisions that are >expected to have very long lifetimes, we use systems that >don't penalize non-Roman language groups as severely as >UTF-8 tends to do. This has also been discussed at length. The trade-off is compatibility with all of current practice (ASCII-based) vs this undeniable byte-count penalty for non-Latin scripts. For short string such as URLs, I'm afraid the technical choice is clear. -- François Yergeau <yergeau@alis.com> Alis Technologies Inc., Montréal Tél : +1 (514) 747-2547 Fax : +1 (514) 747-2561
Received on Sunday, 13 April 1997 23:56:18 UTC