Re: Character encoding of HTTP requests and responses

I'm going to try and answer these questions, but these are not
authoritative answers. You may want to look at the RFC on the
Internationalization of HTML and recent drafts of the HTTP/1.1 spec, and
some other stuff that came out of the html and http working groups of the
IETF. See:

http://www.ics.uci.edu/pub/ietf/html/
http://www.ics.uci.edu/pub/ietf/http/

for an useful archive of the working documents and products of both groups.

Basic thing to understand are:

(1) HTTP uses a lot of protocol mechanisms from MIME, but assumes, unlike
MIME that the data path is 8-bit clean

(2) HTTP and HTML have historically defaulted to ISO-8859-1, but both specs
have been improved in later revisions to accomodate any charset/character
encoding.

(3) There are extra issues in HTML regarding the SGML "document character
set", as opposed to the character encoding. I'm not going to try and say
much about this except to note that they become fairly trivial for Unicode,
because the strategy taken to internationalize HTML was basically to use
ISO-10646 as the SGML document character set _regardless_ of the character
encoding/charset used to transmit or store the HTML.

>Specifically, some questions I have are:
>
>1.Do HTTP clients (browsers) post request body content in their native
>encoding and specify the content encoding in one of the requests
>headers?

HTTP requests and replies consist of a set of headers and an optional body.
(POST has a body, GET does not).

The headers are in a fairly stereotyped form that is always restricted to
US-ASCII, with the exception of some headers that can contain free text. I
think text in headers is assumed to be in ISO-8859-1. I _think_ that text
in other character encodings can be represented as via and encoding defined
in RFC 1522 and RFC 2047, but you have to read the grammer to see where
this is allowed.

The Accept-Charset: header in a request indicates what charsets/character
encodings are preferred. A parameter to Content-Type: indicates the
charset/character encoding of the response.

>2.If so, what is the standard header and its format?

Read the HTTP/1.1 spec for a better description.

Some examples:

Accept-Charset: iso-8859-5, unicode-1-1;q=0.8

Content-Type: text/html; charset=ISO-8859-5

However, you can't count on all clients to label charsets in all requests, yet.

>3.If a client posts a request body with a character encoding that is
>different than the Web servers native encoding, is the body passed to
>CGI or Servlet processing as-is?  That is, as an unaltered byte stream,
>or does the Web server convert to its own native encoding?

I don't know. The CGI spec was written before internationalization stuff
was standardized fot HTTP/HTML. I'd guess it would be passed thru as-is.

>4.If passed as-is, is the standard HTTP header that specifies the
>encoding of the request's body made available to the processing CGI or
>Servlet by the Web server?
>5.If so, what CGI variable will it be in?

In CGI, all HTTP headers are mapped to variables with names starting with
HTTP_ if a Content-Type: header is there, I think you will get
HTTP_CONTENT_TYPE, in addition CGI defines a CONTENT_TYPE variable, but I'm
not sure if it will include the parameters.

I tried experimenting with a GET/POST with Netscape and only determined
that it isn't labeing charset by default. Below are envirionment variable
for a GET and POST request with Netscape 4 and Apache 1.2.4. I asked for
Japanese, but I didn't actually change my default charset.

>SERVER_ADMIN=webmaster@www.nwu.edu
>QUERY_STRING=
>PATH=/usr/sbin:/usr/bin:/sbin
>SCRIPT_FILENAME=/nuinfo/httpd/staff-test-cgi/lunde/show-env
>REMOTE_PORT=1040
>REMOTE_HOST=socrates.tss.nwu.edu
>CONTENT_TYPE=application/x-www-form-urlencoded
>HTTP_HOST=www.nwu.edu
>GATEWAY_INTERFACE=CGI/1.1
>HTTP_REFERER=file:///Temp/RFCs/textform.html
>REQUEST_URI=/staff-test-cgi/lunde/show-env
>HTTP_CONNECTION=Keep-Alive
>SERVER_SOFTWARE=Apache/1.2.4
>REQUEST_METHOD=POST
>SERVER_NAME=www.nwu.edu
>HTTP_ACCEPT_CHARSET=iso-8859-1,*,utf-8
>HTTP_USER_AGENT=Mozilla/4.03 (Macintosh; U; PPC)
>CONTENT_LENGTH=17
>HTTP_ACCEPT=image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
>HTTP_ACCEPT_LANGUAGE=ja, en-US, fr-CA
>SCRIPT_NAME=/staff-test-cgi/lunde/show-env
>SERVER_PORT=80
>SERVER_PROTOCOL=HTTP/1.0
>REMOTE_ADDR=129.105.110.129
>TZ=CST6CDT
>DOCUMENT_ROOT=/nuinfo/httpd/htdocs

>SERVER_ADMIN=webmaster@www.nwu.edu
>QUERY_STRING=
>PATH=/usr/sbin:/usr/bin:/sbin
>SCRIPT_FILENAME=/nuinfo/httpd/staff-test-cgi/lunde/show-env
>REMOTE_PORT=1039
>REMOTE_HOST=socrates.tss.nwu.edu
>HTTP_HOST=www.nwu.edu
>GATEWAY_INTERFACE=CGI/1.1
>HTTP_REFERER=file:///Gaia/ac%20downloads/myweb/Default.html
>REQUEST_URI=/staff-test-cgi/lunde/show-env
>HTTP_CONNECTION=Keep-Alive
>SERVER_SOFTWARE=Apache/1.2.4
>REQUEST_METHOD=GET
>SERVER_NAME=www.nwu.edu
>HTTP_ACCEPT_CHARSET=iso-8859-1,*,utf-8
>HTTP_USER_AGENT=Mozilla/4.03 (Macintosh; U; PPC)
>HTTP_ACCEPT=image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
>HTTP_ACCEPT_LANGUAGE=ja, en-US, fr-CA
>SCRIPT_NAME=/staff-test-cgi/lunde/show-env
>SERVER_PORT=80
>SERVER_PROTOCOL=HTTP/1.0
>REMOTE_ADDR=129.105.110.129
>TZ=CST6CDT
>DOCUMENT_ROOT=/nuinfo/httpd/htdocs

The charset of POST requests is a sticky issue, which I think is talked
about more in the RFC on internationalization of HTML and the RFC on file
uploads.

>6.Do HTTP servers send response body content in their native encoding
>and specify the content encoding in one of the response header?  If so,
>what is the standard header and its format?  Or...
>7.Do HTTP servers send response body content in a client (browser)
>specified encoding?  If so, what is the request header that specifies
>the possible response encodings?

Body content is sent in some character encoding, usually consisent with
Accept-Charset. A server may not have a "native encoding" as such what
matters is what is sent over the wire. Other conversions are an
implementation issue.

See (1) and (2) above.

Also note the HTTP/1.1 uses Content-Encoding: asnd Transfer-Encoding: which
have nothing to do with chasrset.

>8.If the HTTP server does conversion to send responses in the requested
>encoding, what encoding is it expecting for CGI and Servlet produced
>output?  Does it assume CGI and Servlet output is in the servers native
>encoding and convert it on the fly?  Or..

I don't know, this is a CGI/Servlet issue, not an HTTP protocol issue.


---
    Albert Lunde                      Albert-Lunde@nwu.edu

Received on Wednesday, 3 December 1997 14:03:09 UTC