Re: URLs and double byte characters (unicode) from Andrew Clover on 2002-12-24 (www-talk@w3.org from November to December 2002)

From: Andrew Clover <and@doxdesk.com>
Date: Tue, 24 Dec 2002 18:40:43 +0000
To: www-talk@w3.org
Message-ID: <20021224184043.GA30514@doxdesk.com>

> (is this accepted in the specs?), e.g.:

> http://localhost/&eacute;

No, I don't believe it is. This must be %-encoded in the document.

Browsers may do whatever they like in this situation. Opera uses UTF-8.
Mozilla uses the current page encoding. IE/Win may do either depending
on the setting 'always use UTF-8 for URLs'.

GET forms used to submit non-USASCII characters are also supposed to be
bad because there is no way to specify what the incoming encoding is.
However in practice, there is no benefit to using a POST form instead,
because all browsers I know of fail to submit the encoding information even
in POSTs! (There is no Content-Encoding header on the HTTP request, and
in the case of multipart/form-data POSTs, no Content-Encoding header on
the separate parts.)

So you have to know the encoding of the incoming request in all cases.
This should come from the enctype attribute of the form element, using the
page's encoding as a default. However, again, no browser I have met supports
enctype; all use the page encoding.

Even more annoyingly, the Content-Disposition headers browsers pass in
multipart/form-data POSTs is wrong: they specify a field name with quotes
around it (not bothering to encode any out-of-range characters inside the
quotes), instead of using RFC2047 encoding.

And it's probably not possible to change this sort of behaviour now
without breaking things. :-(

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/

Received on Tuesday, 24 December 2002 13:52:05 UTC