- From: Xuan Baldauf <xuan--uri@baldauf.org>
- Date: Mon, 09 Oct 2000 01:27:17 +0200
- To: uri@w3c.org
Hello, I'm new to this list (and therefore maybe this issue has been flamed to death or is an FAQ), but I could not find anything in the list archives to following issue: The use of "UTF-8 over %HH" as default encoding for non-ASCII characters as recommended in memo RFC2718 Section 2.2.5 is a BadThing(tm) and strongly against "common practice". Many applications (including current Netscape Communicator and Internet Explorer) tend to assume that every byte above 127 is Latin-1 or even cp1252. This assumption will get wrong when HTTP clients start to use the byte-space of Latin-1 for UTF-8. That's why I implemented Unicode it on my project both client and server side as UTF-16 encoding like %uHHHH where H is a hex digit, big endian. The full value makes up a UTF-16 value. Because in the old escape form %HH, H can only be in range ['0'..'9','A'..'F','a'..'f'], we have plenty of namespace which is currently undefined (from %gXXXX to %zXXXX). Servers which do not understand this must reply with "400 Bad Request" (RFC2616), so the client knows it should retry with Latin-1 or signal the user that the server is not capable of Unicode-URLs. But the current behaviours makes it ambigous wether %da%bf represents U+6bd (ARABIC LETTER TCHEH WITH DOT ABOVE) (new understanding of URL-escape) or wether it represents "Ú¿" (U+da, U+bf) (old understanding of URL-escape). This ambiguity is a magnitude worse than a "400 Bad Request" which can be retried. Please think about this comment and imagine a world where servers do not understand clients and vice-versa because the one interpretes URLs with Latin-1 charset, the next one with UTF-8 charset. Using the unused namespace after '%' like %uHHHH provides a clean transition from old conventions to new ones, not a collission where conventions compete and the competition case is the failure case. This cleanup is not needed so strongly for "common" web sites, because most web sites have the ability to choose the filename used and therefore can choose wether to require unicode support or not. But it is needed for internationalized forms, because the most common (and most efficient) content-type for POST requests is x-www-form-urlencoded, and GET requests obviously are URL-encoded. http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 defines content-type x-www-form-urlencoded to be only ASCII-capable, this is not necessary and this is not current common practice. Content-type multipart/form-data could be a solution, but it is largely inefficient, not supported by current browsers and not usable for GET requests, because POST requests cannot be represented by URLs and URLs|URIs often are the only way to effciently specify a resource. The conrete problem I have is a web-chat-server which enables the user to input his|her chat comments into ordinary <INPUT TYPE=TEXT> input lines. Browsers sent data encoded by Latin-1 and cp1252, the chat server is per definition unable to comply both the practical requirement to interpret URLs as UTF-8 and to interpret URLs as Latin-1|cp1252 as UTF-8 and Latin-1 are byte-incompatible. I hope that illustrated the problem and the proposed solution. :-) Xuân.
Received on Sunday, 8 October 2000 19:26:40 UTC