- From: Dan Connolly <connolly@w3.org>
- Date: Mon, 14 Apr 1997 12:54:41 -0500
- To: Francois Yergeau <yergeau@alis.com>
- Cc: uri@bunyip.com, bert@w3.org
Francois Yergeau wrote: > > Any application that transmits a URL in > >non-ASCII characters is declared non-compliant. > > You are confusing characters and bytes. While you may want to restrict the > transmitted bytes to 7 bits (but again, why?), you cannot restrict the > range of characters. Hence a full mapping is required, not ASCII-only. > The current spec omits that mapping. I have been shooting from the hip on this I18N/URL stuff for a while, but some folks at WWW6 wanted the full weight of W3C behind it, so I've been trying to think more carefully. And this issue of transcribing non-ascii URLs particularly concerns me. On the one hand, it makes a lot of sense that if a user creates a file and gives it a hebrew or arabic or CJK name, and then exports the file via an HTTP server, that the Address: field in a web browser should show the hebrew or arabic or ... characters faithfully. On the other hand, suppose that address is to be printed and put in an advertisement or a magazine article. Should it print the hebrew/arabic/CJK characters using those glyphs? Or should it print ASCII glyphs corresponding to the characters of the %xx encoding of the original characters? If the former, then reliability suffers: the odds that a random person on the globe can faithfully key in a hebrew/arabic/CJK name seem considerably lower than the odds that they can key in an ASCII name. (though the odds of correctly transcribing a long sequence of %xx codes is vanishingly small too...) (I'm not saying that everybody knows english, but rather that a person using a computer connected to the internet has a farily high probablility of being able to match the 'a' character on a peice of paper to the 'a' character on the keyboard.) If the latter, then the system is very much biased to the *American* Standard Code of Information Interchange. It seems to me that the minimally constraining thing to do is to specify both and allow folks to choose: specify how Unicode strings fit into URLs, and then advise folks to use a small subset of Unicode if their audience is international (and at the same time, add a few more notes: perhaps advise folks that mixing upper and lowercase increases the risk of transcription errors). What's the conventional wisdom among the DNS folks? Surely they face the same issue. Regarding process, it seems clear (based on Larry M and John K's input) that specifying how Unicode strings fit into URLs is not the sort of thing one adds to a proposed standard to make it a draft standard. But I'm not terribly interested in a draft standard that doesn't address this issue -- even if only to say "we thought about encoding Unicode in URLs, but decided against it for the following reasons... ." In either case, a separate internet draft on the subject seems like a perfectly good idea. I don't think the risk of "incompatible standards" is unmageable. Larry has asked for implementation experience. Such experience seems to be growing. None of the implementors has reported any problems (as far as I can see). Regarding Jigsaw and Amaya... Support in Jigsaw should be easy. I'll look into it. Anybody want to do it for me? Should be a quick hack. Support in Amaya would be more work. I don't think we've crossed the hurdle of getting non-western fonts working in Amaya, not to mention internationalized input. -- Dan Connolly, W3C Architecture Domain Lead <connolly@w3.org> +1 512 310-2971 http://www.w3.org/People/Connolly/ PGP:EDF8 A8E4 F3BB 0F3C FD1B 7BE0 716C FF21
Received on Monday, 14 April 1997 13:54:54 UTC