W3C home > Mailing lists > Public > www-international@w3.org > July to September 2006

Re: Character encoding information of Http Request

From: Chris Haynes <chris@harvington.org.uk>
Date: Wed, 13 Sep 2006 21:52:59 +0100
Message-ID: <05d101c6d776$98aa7ae0$0600000a@john>
To: <www-international@w3.org>

"Bjoern Hoehrmann" commented:

> * Richard Ishida wrote:
>>If you are aware of current or future standard to retrieve the
>>character encoding from the browser, please let me know.
> "HTTP request" is rather broad, there are many places in a HTTP requests
> where character information is conveyed to web applications, including
> the request resource identifier, individual header fields, and the body
> of the message. Clients would only need to convey encoding information
> if it is possible that different clients use different encodings. This
> is typically avoided (through means such as picking the page encoding
> properly, or the accept-charset attribute for HTML forms). Where this is
> insufficient, http://www.w3.org/TR/web-forms-2/#the-charset can be used
> in some browsers. draft-hoehrmann-urlencoded-00 is an upcoming Internet-
> Draft that helps removing the need to convey the character encoding to
> web applications. I am not sure there are cases where you need _charset_
> if you want UTF-8 submissions, though.
> -- 

There is a very specific problem that needs considering. I raised it at the time 
that IRIs were being developed, but was informed that it was not the time/place 
to consider the issue.

Here is the specific problem:

Consider HTTP requests being made using GET with parameter values being 
URL-encoded in the request line.

e.g. http://www.example.com?name=fr%hh%HH

%hh and %HH represent a pair of octet values.

As far as I know, there is no current RFC-specified method for determining the 
encoding which these two octets represent.

The 'default' behaviour of the majority of early web servers was to _assume_ 
that these two octets represented two characters in a 'local' character set 
having 256 code point values, typically ISO 8859-1.

Many modern systems would prefer the text data to be transmitted in UTF-8, so as 
to be able to represent the full Unicode character set. As Bjoern Hoehrmann 
indicates, it is easy to instruct the browser to encode all text from an HTML 
FORM using UTF-8.

But how does the web server know that this has been done?

One needs to understand a common architecture of modern web servers, such as 
those which support multiple Web Applications. These Web Applications may be 
developed and deployed by wholly independent entities and hosted by a common web 
server, whose operator knows nothing of the design rules adopted by the 
individual Web Application developers.

One web application developer may be happy with the default ISO 8859-1 encoding, 
another may explicitly require all her FORMs to encode text in UTF-8.

The problem is that the web server (the Web Application 'container') typically 
parses and decodes the URL-encoded request parameters in a common way. The 
container has no way of knowing (from any information contained in the request) 
which encoding has been used in any specific inbound request.

It is possible for the WebApplication to intervene in the process, and force the 
container to use a specific decoding before parsing the paarameters, but this 
a) Special, non-obvious programming which is state-sensitive,
b) The designers of the HTML pages containing the FORMs and the designers of the 
program analysing the HTTP request parameters to have absolute conformance to a 
common design rule - not always culturally reliable across time and space.

So a _specific_ problem, within the space identified by Nelson Ng of e-Bay, is 
the need to be able to communicate the character encoding used for the 
URL-encoded parameters of a GET request.

I'll be interested to see if draft-hoehrmann-urlencoded-00 addresses this 

Chris Haynes 
Received on Wednesday, 13 September 2006 20:53:17 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:27 UTC