- From: Chris Lilley <chris@w3.org>
- Date: Tue, 28 May 2002 15:05:28 +0200
- To: www-tag@w3.org, Keith Moore <moore@cs.utk.edu>
- CC: Martin Duerst <duerst@w3.org>, Tim Bray <tbray@textuality.com>
On Tuesday, May 28, 2002, 6:35:03 AM, Keith wrote: >> While indeed currently http is defined to use %hh escaping, >> why would there be a need to restrict over the wire to ASCII, >> in particular for future protocols? TCP/IP doesn't have any >> problems with 8-bit data. KM> TCP doesn't have any problems with arbitrary binary data KM> either, but for some reason people often prefer to use text. KM> The popularity of XML illustrates that rather well. Which brings us full circle since "text" is the point of (this portion of) my comments and is rather closer to binary data than to ascii data. KM> For similar reasons, people often prefer to restrict the KM> set of characters that are used for certain purposes. For instance, KM> it's useful if resource identifiers are transmitted in a form KM> that can be displayed on any terminal, transcribed on most KM> keyboards, and printed on any printer. The 'ascii is all you need' argument; even then, watered down to "most" since there are plenty of keyboards that have no alphabetic letters (on mobile phones and PDAs for example) or can only display upper case, or can access ascii characters only with extra shifts or control keys; there are also restricted printers that can only print numerals and a few other characters and similarly there are terminals that have restricted display capabilities. So the claim of universality is partial at best. I don't see a lot of call to restrict URIs over the wire to merely ascii numerals, or indeed to the numerals zero and one which are, technically "all you need" and will get through hostile environments that ascii will not. Universality of input and output needs to be balanced against other factors such as user acceptance, legibility and efficiency. Another flaw with this argument is that the use case is "transmission" but the examples are all presentation - visual display for a human reader. The one does not need to follow the other. And on the other hand there is plenty of equipment in daily use around the world that can display a broader and more globally useful range of characters. At last months Unicode conference there was an excellent talk by an implementor about doing Arabic and Thai display and input on a low-end current generation mobile phone with total system memory in the hundreds of kilobytes. The thing is, peoples perception of what is "plain ordinary text, just the characters we use every day" varies widely. People expect to see www.starwars.com on the side of a bus or in the credits of a movie. Other people expect to see the same thing in their own language, too, which is after all just plain ordinary everyday text. Which, given the current restrictions on URIs, implies that presentation and transmission should be separated. Its clear what is needed for each case, and that its different for each case. WEhich leaves the third part, the storage of URI references in content. Since, as Keith points out, people prefer textual content such as XML that can be edited and read by humans KM> In other words, it's not TCP that's the problem - it's the KM> inability of most human beings and their keyboards to cope KM> with the tremendous diversity of characters that are in use. No, not really. The "every character needs a key" argument is a strawman, and easily torched by mention of IMEs or by the simple expedient, on Windows, of holding down the alt key and typing four numerals on the numeric keypad. Also, statistically, "most human beings" deal with a character repertoire of hundreds to thousands of characters in daily life. Remember that Chinese is the most widely spoken language in the world and fast becoming the most widely spoken language on the net, too. KM> TCP is data transparent, but human eyes, minds, voices, KM> and fingers aren't. Thus, current restrictions on transport of data in general and URIs in particular (transmission) should not be spuriously tied to terminals, printers and keyboards which are devices for human input and output. Since you mention minds, voices and fingers please consider a radio advert which mentions a URL and the likelihood of successful transmission if the announcer reads out "percent seven seven percent seven seven percent seven seven percent two ee percent six five percent seven eight percent six one percent six dee percent seven zero percent six cee percent six five percent two ee percent six eff percent seven two percent six seven" on the one hand and "www.example.org" on the other hand. Then translate this example to any other language, which the presenter and the recipient both speak since otherwise the radio would be tuned to a different station, but which does not use the ascii character repertoire. As you so astutely point out, human eyes, minds, voices, and fingers can only really cope with, remember, and reliably transcribe legible, understandable text not some techno gobledegook. Thus legible, ordinary text should be used for presentation and for storage, and converted into lengthy, computer readable but meaningless strings of hex only for those places such as transmission over networks where the specifications (URI, etc) require this. -- Chris mailto:chris@w3.org
Received on Tuesday, 28 May 2002 09:05:37 UTC