Re: Internationalization and URLs

Just some comments on several of the preceeding messages:

In some cases, and the problem of URLs seems to be such a case,
the extension to internationalization leads to the need to revisit
the models used. Revisiting the (almost) uniform URL that tried
not to make any distinction between what appeared on paper
and what was typed into the computer, we find that this model
may not satisfy user needs and may have to be revisited.

>> From:          Larry Masinter <masinter@parc.xerox.com>
>> Date:          Mon, 1 Jul 1996 21:19:59 PDT
>> On the issue of "non-Uniform URLs", you've pointed out that some URLs
>> have alternate forms that are equally valid, including both with and
>> without the %xx encoding. This may be true, but it still is the case
>> that <<what is printed on paper>> can be typed by everyone who sees it
>> without them actually knowing anything about character set encodings.

There are two points here:
- Not to know anything about character sets. This is very important,
	although I generally think that character sets and encodings are
	not that difficult a thing, even most experts in ietf groups get
	them mixed up from time to time, and so hoping that the
	general public would understand this is not suggested.
	For this to be true, it is
	important to introduce a uniform way of mapping Unicode/ISO10646
	to the octets of the internal IURL representation, so that on any
	computer, after typing something in, it is clear how to do the
	mapping and send the stuff over the line.
	Even so, there are some cases that might be dubious if not
	specified clearly, e.g. some Han compatibility ideographs,
	ligatures, equivalence of combining characters, bidi,
	Latin/Greek/Cyrillic "A", etc.

- Not to be able to type the character. This is definitely not an issue
	for the query part. For the other parts, it is mostly the question
	of: If an URL for a document in Japanese appears in a Japanese
	newspaper, is there really any much harm if that URL is in
	Japanese, too? Put the other way round, what is the benefit
	in this situation of having the URL in such exotic characters
	as ASCII?
	Input per se is never really a problem, it is very easy to build
	a HTML document that allows you to select characters from
	various scripts and put together the "plain" form of that URL
	(either %-escaped UTF-8 or UTF-7 with the +- notation). This
	definitely solves the question of "what would I do if I were
	on a computer without Japanese input capability but had a
	Japanese newspaper with a Japanese URL that I wanted to
	download (and maybe even look at, although it is rare
	that I will be able to view Japanese but not input it).
	The problem that remains is that we cannot expect from
	an absolute ignorant of Han characters to be able to input
	some Japanese, even with lots of nice menues and charts.
	Note that this is probably not true for c-cedillia, it should
	be possible to guide a user through a process of saying
	"well, it looks almost like a c, please show me what you have
	that looks like a c" and then selecting the c-cedillia.
	Nevertheless, of course I think that in an English article
	referencing an English page, it should still be Francois
	without cedillia.

>> I think you're suggesting that newspapers should print "Franc,ois" and
>> those who do not have c-cedilla on their keyboard should know how to
>> translate such a thing into the appropriate %xx code.  Is this really
>> a reasonable suggestion?
>Probably not.  Newspapers and such who want to make sure anyone can 
>*type* in an URL should use ASCII-only one, either the %XX form or an 
>ASCII-only URL made up for the circumstances.  To me, this is not a 
>good enough reason to *forbid* the use of richer URLs in other 
>contexts.  They are especially needed for queries.

As I said above, a French newspaper, or even an English newspaper
referring to a French document, could do this. In the future, we should
expect all computers to have something such as an "universal keyboard
mechanism" allowing the selection of such characters.

For future proceeding towards IURLs, I see the following steps:
- Continue the present discussion to get a common understanding.
- Check on internationalization strategies for various parts of an URL
	(service, domain name, path, newsgroup, mail address,...)
	because this affects what internationalization scheme should be
- Do some more discussion on UTF-8 vs. UTF-7.
- Start to write some text on how ambiguities should be avoided.

Regards,	Martin.

Follow-Ups: References: