[whatwg] Re: form charset from Peter Karlsson on 2005-04-20 (public-whatwg-archive@w3.org from April 2005)

From: Peter Karlsson <peter@opera.com>
Date: Wed, 20 Apr 2005 08:01:19 +0100 (CET)
Message-ID: <Pine.LNX.4.62.0504200756080.4030@peter.oslo.opera.com>

Olav Junker Kj?r on 2005-04-20:

> However, is it really the right thing to allow arbitrary encodings of GET 
> queries in the first place? The official Right Way to encode URLs is to 
> use Utf8, and it seems strange to allow a different encoding after the 
> question mark.

Strange as it may seem, that's the way it is currently done. HTML 4.01 says 
that the character encoding of any forms data should be the document character 
encoding, unless there is an accept-charset attribute on the form stating 
otherwise. This means that you do need to handle the part of the URL after 
the first question mark differently from the the part before it (but then 
again, you also do need to handle the domain name different from the path 
component, so this segmentation isn't that unexpected).

This is usually not a problem until you find something like this embedded in 
a search page (where "{chinese}" is the Chinese search text you just entered 
in the search field):

   <a href="/search?q={chinese}">Next &gt;</a>

And yes, this very much does exist in the wild.

> Of course we cannot just mandate utf8 always, since there is the issue of 
> backwards compatibility. If I'm not mistaken, browsers usually urlencode 
> forms using the same charset as the page.

Correct.

> However, the only legal value in accept-charset should be utf8 when the 
> method is GET.

UTF-8 and US-ASCII, probably.

-- 
\\//
Peter, software engineer, Opera Software

  The opinions expressed are my own, and not those of my employer.
  Please reply only by follow-ups on the mailing list.

Received on Wednesday, 20 April 2005 00:01:19 UTC