URL syntax: Typeability from Martin J. Duerst on 1996-12-19 (uri@w3.org from December 1996)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 19 Dec 1996 21:35:19 +0100 (MET)
To: uri@bunyip.com
Message-Id: <Pine.SUN.3.95.961219205452.245Z-100000@enoshima>
Continuing with my comments on draft-fielding-url-syntaxt-02.txt:

The draft stresses transcribability very strongly, and with a very
specific understanding. While a justification for why the current
syntax was choosen is definitely a good thing, I think that the
draft clearly overdoes in this area, for various reasons:

- As shown in an earlier mail, the existing syntax has its own
	problems e.g. with European keyboards.

- The draft itself rightfully mentionnes the use of meaningful
	components to help people remember URLs. This is a
	great step to recognize a very important fact that
	some earlier designs and documents ignored or excluded.

- The draft even goes as far as admiting that there is practice
	beyond what is allowed:

 >  Excluded characters must be escaped in order to be properly
 >  represented within a URL.  However, there do exist some systems that
 >  allow characters from the "unwise" and "national" sets to be used in
 >  URL references; a robust implementation should be prepared to handle
 >  those characters when it is possible to do so.

	which may be taken as an indication that keyboarding and
	other aspects of transcribability are seen as less important
	by some users.

- The draft ignores a couple of arguments that show that taking
	a lowest-common-denominator perspective to allow "everybody"
	to transcribe "every" URL is overevaluated. I developped these
	arguments mostly in the recent URN discussion (special thanks
	to "devil's advocate" Keith Moore :-). Despite the important
	differences between URNs and URLs, they can be applied to URL.

	Requiring that all URLs exist in ASCII, and only in ASCII,
	because many people cannot type anything else, sounds
	like requiring that all newspaper texts be printed with
	a minimum typesize of 16pt, because many people cannot
	read smaller print :-). Doing so would make newspapers
	overly clumsy for most (ASCII makes many URLs very clumsy
	for those that actually use them. Transcribability, went
	averaged over the weighted set of potential users, is
	significantly lower). It is not necessary because there
	are glasses (it's easy to construct a Java Web applet/page
	that provides any keyboard whatever and any additional input
	support whatever).

	Requiring all URLs to exist only in an ASCII subset also
	gives many problems when creating them. How do you create
	a (syntactically correct!) URL for Japanese files? You
	are not supposed to just input the filename into your
	HTML page!


As a consequence of the above considerations, I argue for a
"downsizing" of the general transcribability issues in the
various parts where it is mentionned.


One particular issue in this context, remaining from an earlier mail,
is the following text:

> 2. URL Characters and Character Escaping
> 
>    All URLs consist of a restricted set of characters, chosen to
>    maximize their transcribability and usability across varying computer
>    systems, natural languages, and nationalities.  This restricted set
>    corresponds to a subset of the graphic printable characters of the
>    US-ASCII coded character set [11].

If one wants to maximise transcribability aross computer systems
and languages (we dropped nationalities already), the best solution
is to choose an URL representation most practical for the majority
of the users that will use that URL. For an URL describing some
Greek resource, that will most probably be Greek, and so on.

Probably, it is therefore best to shorten that paragraph to:

>    All URLs consist of a restricted set of characters. This restricted set
>    corresponds to a subset of the graphic printable characters of the
>    US-ASCII coded character set [11].

And while we are at it, US-ASCII (as referenced) does not contain
SPACE nor DELETE, and no control characters. All ASCII characters
are printable. See ECMA registration Nr. 6. Therefore, the text will become:

>    All URLs consist of a restricted set of characters. This restricted set
>    corresponds to a subset of the US-ASCII coded character set [11].


It's late now. It looks like I have to delay the real I18N issues
to tomorrow.

Regards,	Martin.
Received on Thursday, 19 December 1996 15:36:00 UTC