UTF-8 in URIs from Xuan Baldauf on 2000-10-08 (uri@w3.org from October 2000)

From: Xuan Baldauf <xuan--uri@baldauf.org>
Date: Mon, 09 Oct 2000 01:27:17 +0200
To: uri@w3c.org
Message-ID: <39E102D5.4A08112D@baldauf.org>
Hello,

I'm new to this list (and therefore maybe this issue has been flamed to
death or is an FAQ), but I could not find anything in the list archives
to following issue:

The use of "UTF-8 over %HH" as default encoding for non-ASCII characters
as recommended in memo RFC2718 Section 2.2.5 is a BadThing(tm) and
strongly against "common practice".

Many applications (including current Netscape Communicator and Internet
Explorer) tend to assume that every byte above 127 is Latin-1 or even
cp1252. This assumption will get wrong when HTTP clients start to use
the byte-space of Latin-1 for UTF-8. That's why I implemented Unicode it
on my project both client and server side as UTF-16 encoding like

%uHHHH

where H is a hex digit, big endian. The full value makes up a UTF-16
value. Because in the old escape form %HH, H can only be in range
['0'..'9','A'..'F','a'..'f'], we have plenty of namespace which is
currently undefined (from %gXXXX to %zXXXX).

Servers which do not understand this must reply with "400 Bad Request"
(RFC2616), so the client knows it should retry with Latin-1 or signal
the user that the server is not capable of Unicode-URLs.

But the current behaviours makes it ambigous wether %da%bf represents
U+6bd (ARABIC LETTER TCHEH WITH DOT ABOVE) (new understanding of
URL-escape) or wether it represents "Ú¿" (U+da, U+bf) (old understanding
of URL-escape).

This ambiguity is a magnitude worse than a "400 Bad Request" which can
be retried. Please think about this comment and imagine a world where
servers do not understand clients and vice-versa because the one
interpretes URLs with Latin-1 charset, the next one with UTF-8 charset.
Using the unused namespace after '%' like %uHHHH provides a clean
transition from old conventions to new ones, not a collission where
conventions compete and the competition case is the failure case.

This cleanup is not needed so strongly for "common" web sites, because
most web sites have the ability to choose the filename used and
therefore can choose wether to require unicode support or not. But it is
needed for internationalized forms, because the most common (and most
efficient) content-type for POST requests is x-www-form-urlencoded, and
GET requests obviously are URL-encoded.
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 defines
content-type x-www-form-urlencoded to be only ASCII-capable, this is not
necessary and this is not current common practice. Content-type
multipart/form-data could be a solution, but it is largely inefficient,
not supported by current browsers and not usable for GET requests,
because POST requests cannot be represented by URLs and URLs|URIs often
are the only way to effciently specify a resource.

The conrete problem I have is a web-chat-server which enables the user
to input his|her chat comments into ordinary <INPUT TYPE=TEXT> input
lines. Browsers sent data encoded by Latin-1 and cp1252, the chat server
is per definition unable to comply both the practical requirement to
interpret URLs as UTF-8 and to interpret URLs as Latin-1|cp1252 as UTF-8
and Latin-1 are byte-incompatible.

I hope that illustrated the problem and the proposed solution. :-)

Xuân.
Received on Sunday, 8 October 2000 19:26:40 UTC