RE: %NN encoding, request/response headers, UTF-8 ?

>  - whether Unicode characters with values 0x100 and greater are allowed in
>    request headers (especially the request line)
>  - if so, if UTF-8 encoding is allowed

The "request line" and "request headers" are different contexts.
If you're thinking about using IRIs as the request URI
  (draft-masinter-uri-i18n-07.txt)
it is still necessary to convert the IRI to a URI before using it
in the HTTP protocol, because the HTTP protocol only uses URIs
which have a restricted character repertoire.

>  - how a client indicates to the server that it's using UTF-8

depends on the context

>  - how an HTTP server application decides how to interpret
>    hex-encoded information, e.g. is %C3%B1 encoding two characters,
>    or the UTF-8 encoding for the single character "ñ"

The hex-encoding is only used for URIs and not for other elements,
and what it encodes depends entirely on the server that serves it.
This is entirely server-dependent, and %C3%B1 might represent
something entirely different, not characters at all. There is
no restriction that encoding in URIs actually corresponds to
character data.


>  - how/if a server might use UTF-8 in its response headers
> It looks like any content that is sent with MIME headers (e.g., an object
> sent by the HTTP server) could be announced with a charset value indicating
> UTF-8 encoding, but that headers (request or response) are only expected to
> contain characters 0x00 -> 0xFF. Yet I don't see this clearly stated.


Some response headers allow TEXT, e.g., on comments.

RFC 2616 section 14.46 (description of Warning header)

   If a character set other than ISO-8859-1 is used, it MUST be encoded
   in the warn-text using the method described in RFC 2047 [14].

RFC 2616, section 2.2, Basic Rules:
       OCTET          = <any 8-bit sequence of data>
       CHAR           = <any US-ASCII character (octets 0 - 127)>

...

   The TEXT rule is only used for descriptive field contents and values
   that are not intended to be interpreted by the message parser. Words
   of *TEXT MAY contain characters from character sets other than ISO-
   8859-1 [22] only when encoded according to the rules of RFC 2047
   [14].

       TEXT           = <any OCTET except CTLs,
                        but including LWS>

This means that you can use UTF-16 if you first base64 encode it
and then use it within the RFC 2047 method, viz
                =?UTF-8?b?<base64string>?=


> It seems fairly clear, though, that double-byte character sets (e.g., 16
> bits for each character regardless of its value) should not be used in
> either request or response headers. Right?

Not without some kind of encoding, but the encoding rules differ
according to the context.

Larry
--
http://larry.masinter.net

Received on Saturday, 16 June 2001 17:07:42 UTC