Re: Internationalization and URLs
Just some comments on several of the preceeding messages:
In some cases, and the problem of URLs seems to be such a case,
the extension to internationalization leads to the need to revisit
the models used. Revisiting the (almost) uniform URL that tried
not to make any distinction between what appeared on paper
and what was typed into the computer, we find that this model
may not satisfy user needs and may have to be revisited.
>> From: Larry Masinter <firstname.lastname@example.org>
>> Date: Mon, 1 Jul 1996 21:19:59 PDT
>> On the issue of "non-Uniform URLs", you've pointed out that some URLs
>> have alternate forms that are equally valid, including both with and
>> without the %xx encoding. This may be true, but it still is the case
>> that <<what is printed on paper>> can be typed by everyone who sees it
>> without them actually knowing anything about character set encodings.
There are two points here:
- Not to know anything about character sets. This is very important,
although I generally think that character sets and encodings are
not that difficult a thing, even most experts in ietf groups get
them mixed up from time to time, and so hoping that the
general public would understand this is not suggested.
For this to be true, it is
important to introduce a uniform way of mapping Unicode/ISO10646
to the octets of the internal IURL representation, so that on any
computer, after typing something in, it is clear how to do the
mapping and send the stuff over the line.
Even so, there are some cases that might be dubious if not
specified clearly, e.g. some Han compatibility ideographs,
ligatures, equivalence of combining characters, bidi,
Latin/Greek/Cyrillic "A", etc.
- Not to be able to type the character. This is definitely not an issue
for the query part. For the other parts, it is mostly the question
of: If an URL for a document in Japanese appears in a Japanese
newspaper, is there really any much harm if that URL is in
Japanese, too? Put the other way round, what is the benefit
in this situation of having the URL in such exotic characters
Input per se is never really a problem, it is very easy to build
a HTML document that allows you to select characters from
various scripts and put together the "plain" form of that URL
(either %-escaped UTF-8 or UTF-7 with the +- notation). This
definitely solves the question of "what would I do if I were
on a computer without Japanese input capability but had a
Japanese newspaper with a Japanese URL that I wanted to
download (and maybe even look at, although it is rare
that I will be able to view Japanese but not input it).
The problem that remains is that we cannot expect from
an absolute ignorant of Han characters to be able to input
some Japanese, even with lots of nice menues and charts.
Note that this is probably not true for c-cedillia, it should
be possible to guide a user through a process of saying
"well, it looks almost like a c, please show me what you have
that looks like a c" and then selecting the c-cedillia.
Nevertheless, of course I think that in an English article
referencing an English page, it should still be Francois
>> I think you're suggesting that newspapers should print "Franc,ois" and
>> those who do not have c-cedilla on their keyboard should know how to
>> translate such a thing into the appropriate %xx code. Is this really
>> a reasonable suggestion?
>Probably not. Newspapers and such who want to make sure anyone can
>*type* in an URL should use ASCII-only one, either the %XX form or an
>ASCII-only URL made up for the circumstances. To me, this is not a
>good enough reason to *forbid* the use of richer URLs in other
>contexts. They are especially needed for queries.
As I said above, a French newspaper, or even an English newspaper
referring to a French document, could do this. In the future, we should
expect all computers to have something such as an "universal keyboard
mechanism" allowing the selection of such characters.
For future proceeding towards IURLs, I see the following steps:
- Continue the present discussion to get a common understanding.
- Check on internationalization strategies for various parts of an URL
(service, domain name, path, newsgroup, mail address,...)
because this affects what internationalization scheme should be
- Do some more discussion on UTF-8 vs. UTF-7.
- Start to write some text on how ambiguities should be avoided.