- From: Larry Masinter <LMM@acm.org>
- Date: Sat, 16 Jun 2001 09:02:54 -0700
- To: "Peter W" <peterw@usa.net>
- Cc: <http-wg@cuckoo.hpl.hp.com>, "Martin J. Dürst" <duerst@w3.org>
> - whether Unicode characters with values 0x100 and greater are allowed in
> request headers (especially the request line)
> - if so, if UTF-8 encoding is allowed
The "request line" and "request headers" are different contexts.
If you're thinking about using IRIs as the request URI
(draft-masinter-uri-i18n-07.txt)
it is still necessary to convert the IRI to a URI before using it
in the HTTP protocol, because the HTTP protocol only uses URIs
which have a restricted character repertoire.
> - how a client indicates to the server that it's using UTF-8
depends on the context
> - how an HTTP server application decides how to interpret
> hex-encoded information, e.g. is %C3%B1 encoding two characters,
> or the UTF-8 encoding for the single character "ñ"
The hex-encoding is only used for URIs and not for other elements,
and what it encodes depends entirely on the server that serves it.
This is entirely server-dependent, and %C3%B1 might represent
something entirely different, not characters at all. There is
no restriction that encoding in URIs actually corresponds to
character data.
> - how/if a server might use UTF-8 in its response headers
> It looks like any content that is sent with MIME headers (e.g., an object
> sent by the HTTP server) could be announced with a charset value indicating
> UTF-8 encoding, but that headers (request or response) are only expected to
> contain characters 0x00 -> 0xFF. Yet I don't see this clearly stated.
Some response headers allow TEXT, e.g., on comments.
RFC 2616 section 14.46 (description of Warning header)
If a character set other than ISO-8859-1 is used, it MUST be encoded
in the warn-text using the method described in RFC 2047 [14].
RFC 2616, section 2.2, Basic Rules:
OCTET = <any 8-bit sequence of data>
CHAR = <any US-ASCII character (octets 0 - 127)>
...
The TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser. Words
of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047
[14].
TEXT = <any OCTET except CTLs,
but including LWS>
This means that you can use UTF-16 if you first base64 encode it
and then use it within the RFC 2047 method, viz
=?UTF-8?b?<base64string>?=
> It seems fairly clear, though, that double-byte character sets (e.g., 16
> bits for each character regardless of its value) should not be used in
> either request or response headers. Right?
Not without some kind of encoding, but the encoding rules differ
according to the context.
Larry
--
http://larry.masinter.net
Received on Saturday, 16 June 2001 17:07:42 UTC