- From: Albert Lunde <Albert-Lunde@nwu.edu>
- Date: Wed, 3 Dec 1997 13:02:43 -0600
- To: "www-international@w3.org" <www-international@w3.org>
I'm going to try and answer these questions, but these are not authoritative answers. You may want to look at the RFC on the Internationalization of HTML and recent drafts of the HTTP/1.1 spec, and some other stuff that came out of the html and http working groups of the IETF. See: http://www.ics.uci.edu/pub/ietf/html/ http://www.ics.uci.edu/pub/ietf/http/ for an useful archive of the working documents and products of both groups. Basic thing to understand are: (1) HTTP uses a lot of protocol mechanisms from MIME, but assumes, unlike MIME that the data path is 8-bit clean (2) HTTP and HTML have historically defaulted to ISO-8859-1, but both specs have been improved in later revisions to accomodate any charset/character encoding. (3) There are extra issues in HTML regarding the SGML "document character set", as opposed to the character encoding. I'm not going to try and say much about this except to note that they become fairly trivial for Unicode, because the strategy taken to internationalize HTML was basically to use ISO-10646 as the SGML document character set _regardless_ of the character encoding/charset used to transmit or store the HTML. >Specifically, some questions I have are: > >1.Do HTTP clients (browsers) post request body content in their native >encoding and specify the content encoding in one of the requests >headers? HTTP requests and replies consist of a set of headers and an optional body. (POST has a body, GET does not). The headers are in a fairly stereotyped form that is always restricted to US-ASCII, with the exception of some headers that can contain free text. I think text in headers is assumed to be in ISO-8859-1. I _think_ that text in other character encodings can be represented as via and encoding defined in RFC 1522 and RFC 2047, but you have to read the grammer to see where this is allowed. The Accept-Charset: header in a request indicates what charsets/character encodings are preferred. A parameter to Content-Type: indicates the charset/character encoding of the response. >2.If so, what is the standard header and its format? Read the HTTP/1.1 spec for a better description. Some examples: Accept-Charset: iso-8859-5, unicode-1-1;q=0.8 Content-Type: text/html; charset=ISO-8859-5 However, you can't count on all clients to label charsets in all requests, yet. >3.If a client posts a request body with a character encoding that is >different than the Web servers native encoding, is the body passed to >CGI or Servlet processing as-is? That is, as an unaltered byte stream, >or does the Web server convert to its own native encoding? I don't know. The CGI spec was written before internationalization stuff was standardized fot HTTP/HTML. I'd guess it would be passed thru as-is. >4.If passed as-is, is the standard HTTP header that specifies the >encoding of the request's body made available to the processing CGI or >Servlet by the Web server? >5.If so, what CGI variable will it be in? In CGI, all HTTP headers are mapped to variables with names starting with HTTP_ if a Content-Type: header is there, I think you will get HTTP_CONTENT_TYPE, in addition CGI defines a CONTENT_TYPE variable, but I'm not sure if it will include the parameters. I tried experimenting with a GET/POST with Netscape and only determined that it isn't labeing charset by default. Below are envirionment variable for a GET and POST request with Netscape 4 and Apache 1.2.4. I asked for Japanese, but I didn't actually change my default charset. >SERVER_ADMIN=webmaster@www.nwu.edu >QUERY_STRING= >PATH=/usr/sbin:/usr/bin:/sbin >SCRIPT_FILENAME=/nuinfo/httpd/staff-test-cgi/lunde/show-env >REMOTE_PORT=1040 >REMOTE_HOST=socrates.tss.nwu.edu >CONTENT_TYPE=application/x-www-form-urlencoded >HTTP_HOST=www.nwu.edu >GATEWAY_INTERFACE=CGI/1.1 >HTTP_REFERER=file:///Temp/RFCs/textform.html >REQUEST_URI=/staff-test-cgi/lunde/show-env >HTTP_CONNECTION=Keep-Alive >SERVER_SOFTWARE=Apache/1.2.4 >REQUEST_METHOD=POST >SERVER_NAME=www.nwu.edu >HTTP_ACCEPT_CHARSET=iso-8859-1,*,utf-8 >HTTP_USER_AGENT=Mozilla/4.03 (Macintosh; U; PPC) >CONTENT_LENGTH=17 >HTTP_ACCEPT=image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */* >HTTP_ACCEPT_LANGUAGE=ja, en-US, fr-CA >SCRIPT_NAME=/staff-test-cgi/lunde/show-env >SERVER_PORT=80 >SERVER_PROTOCOL=HTTP/1.0 >REMOTE_ADDR=129.105.110.129 >TZ=CST6CDT >DOCUMENT_ROOT=/nuinfo/httpd/htdocs >SERVER_ADMIN=webmaster@www.nwu.edu >QUERY_STRING= >PATH=/usr/sbin:/usr/bin:/sbin >SCRIPT_FILENAME=/nuinfo/httpd/staff-test-cgi/lunde/show-env >REMOTE_PORT=1039 >REMOTE_HOST=socrates.tss.nwu.edu >HTTP_HOST=www.nwu.edu >GATEWAY_INTERFACE=CGI/1.1 >HTTP_REFERER=file:///Gaia/ac%20downloads/myweb/Default.html >REQUEST_URI=/staff-test-cgi/lunde/show-env >HTTP_CONNECTION=Keep-Alive >SERVER_SOFTWARE=Apache/1.2.4 >REQUEST_METHOD=GET >SERVER_NAME=www.nwu.edu >HTTP_ACCEPT_CHARSET=iso-8859-1,*,utf-8 >HTTP_USER_AGENT=Mozilla/4.03 (Macintosh; U; PPC) >HTTP_ACCEPT=image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */* >HTTP_ACCEPT_LANGUAGE=ja, en-US, fr-CA >SCRIPT_NAME=/staff-test-cgi/lunde/show-env >SERVER_PORT=80 >SERVER_PROTOCOL=HTTP/1.0 >REMOTE_ADDR=129.105.110.129 >TZ=CST6CDT >DOCUMENT_ROOT=/nuinfo/httpd/htdocs The charset of POST requests is a sticky issue, which I think is talked about more in the RFC on internationalization of HTML and the RFC on file uploads. >6.Do HTTP servers send response body content in their native encoding >and specify the content encoding in one of the response header? If so, >what is the standard header and its format? Or... >7.Do HTTP servers send response body content in a client (browser) >specified encoding? If so, what is the request header that specifies >the possible response encodings? Body content is sent in some character encoding, usually consisent with Accept-Charset. A server may not have a "native encoding" as such what matters is what is sent over the wire. Other conversions are an implementation issue. See (1) and (2) above. Also note the HTTP/1.1 uses Content-Encoding: asnd Transfer-Encoding: which have nothing to do with chasrset. >8.If the HTTP server does conversion to send responses in the requested >encoding, what encoding is it expecting for CGI and Servlet produced >output? Does it assume CGI and Servlet output is in the servers native >encoding and convert it on the fly? Or.. I don't know, this is a CGI/Servlet issue, not an HTTP protocol issue. --- Albert Lunde Albert-Lunde@nwu.edu
Received on Wednesday, 3 December 1997 14:03:09 UTC