Re: Comments on charmod from Chris from Chris Lilley on 2002-05-28 (www-tag@w3.org from May 2002)

From: Chris Lilley <chris@w3.org>
Date: Tue, 28 May 2002 15:05:28 +0200
To: www-tag@w3.org, Keith Moore <moore@cs.utk.edu>
CC: Martin Duerst <duerst@w3.org>, Tim Bray <tbray@textuality.com>
Message-ID: <63262372359.20020528150528@w3.org>
On Tuesday, May 28, 2002, 6:35:03 AM, Keith wrote:

>> While indeed currently http is defined to use %hh escaping,
>> why would there be a need to restrict over the wire to ASCII,
>> in particular for future protocols? TCP/IP doesn't have any
>> problems with 8-bit data.

KM> TCP doesn't have any problems with arbitrary binary data
KM> either, but for some reason people often prefer to use text.
KM> The popularity of XML illustrates that rather well.

Which brings us full circle since "text" is the point of (this portion
of) my comments and is rather closer to binary data than to ascii
data.

KM> For similar reasons, people often prefer to restrict the
KM> set of characters that are used for certain purposes.  For instance, 
KM> it's useful if resource identifiers are transmitted in a form 
KM> that can be displayed on any terminal, transcribed on most 
KM> keyboards, and printed on any printer.

The 'ascii is all you need' argument; even then, watered down to
"most" since there are plenty of keyboards that have no alphabetic
letters (on mobile phones and PDAs for example) or can only display
upper case, or can access ascii characters only with extra shifts or
control keys; there are also restricted printers that can only
print numerals and a few other characters and similarly there are
terminals that have restricted display capabilities. So the claim of
universality is partial at best.

I don't see a lot of call to restrict URIs over the wire to merely
ascii numerals, or indeed to the numerals zero and one which are,
technically "all you need" and will get through hostile environments
that ascii will not. Universality of input and output needs to be
balanced against other factors such as user acceptance, legibility
and efficiency.

Another flaw with this argument is that the use case is "transmission"
but the examples are all presentation - visual display for a human
reader. The one does not need to follow the other.

And on the other hand there is plenty of equipment in daily use around
the world that can display a broader and more globally useful range of
characters. At last months Unicode conference there was an excellent
talk by an implementor about doing Arabic and Thai display and input
on a low-end current generation mobile phone with total system memory
in the hundreds of kilobytes.

The thing is, peoples perception of what is "plain ordinary text, just
the characters we use every day" varies widely. People expect to see
www.starwars.com on the side of a bus or in the credits of a movie.
Other people expect to see the same thing in their own language, too,
which is after all just plain ordinary everyday text.

Which, given the current restrictions on URIs, implies that
presentation and transmission should be separated. Its clear what is
needed for each case, and that its different for each case. WEhich
leaves the third part, the storage of URI references in content.
Since, as Keith points out, people prefer textual content such as XML
that can be edited and read by humans

KM> In other words, it's not TCP that's the problem - it's the 
KM> inability of most human beings and their keyboards to cope 
KM> with the tremendous diversity of characters that are in use.

No, not really. The "every character needs a key" argument is a
strawman, and easily torched by mention of IMEs or by the simple
expedient, on Windows, of holding down the alt key and typing four
numerals on the numeric keypad.

Also, statistically, "most human beings" deal with a character
repertoire of hundreds to thousands of characters in daily life.
Remember that Chinese is the most widely spoken language in the world
and fast becoming the most widely spoken language on the net, too.

KM> TCP is data transparent, but human eyes, minds, voices,
KM> and fingers aren't.

Thus, current restrictions on transport of data in general and URIs in
particular (transmission) should not be spuriously tied to terminals,
printers and keyboards which are devices for human input and output.

Since you mention minds, voices and fingers please consider a radio
advert which mentions a URL and the likelihood of successful
transmission if the announcer reads out "percent seven seven percent
seven seven percent seven seven percent two ee percent six five
percent seven eight percent six one percent six dee percent seven zero
percent six cee percent six five percent two ee percent six eff
percent seven two percent six seven" on the one hand and
"www.example.org" on the other hand.

Then translate this example to any other language, which the presenter
and the recipient both speak since otherwise the radio would be tuned
to a different station, but which does not use the ascii character
repertoire.

As you so astutely point out, human eyes, minds, voices, and fingers
can only really cope with, remember, and reliably transcribe legible,
understandable text not some techno gobledegook. Thus legible,
ordinary text should be used for presentation and for storage, and
converted into lengthy, computer readable but meaningless strings of
hex only for those places such as transmission over networks where the
specifications (URI, etc) require this.


-- 
 Chris                            mailto:chris@w3.org
Received on Tuesday, 28 May 2002 09:05:37 UTC