W3C home > Mailing lists > Public > ietf-http-wg-old@w3.org > May to August 2001

RE: %NN encoding, request/response headers, UTF-8 ?

From: Larry Masinter <LMM@acm.org>
Date: Sat, 16 Jun 2001 09:02:54 -0700
To: "Peter W" <peterw@usa.net>
Cc: <http-wg@cuckoo.hpl.hp.com>, "Martin J. Dürst" <duerst@w3.org>
>  - whether Unicode characters with values 0x100 and greater are allowed in
>    request headers (especially the request line)
>  - if so, if UTF-8 encoding is allowed

The "request line" and "request headers" are different contexts.
If you're thinking about using IRIs as the request URI
it is still necessary to convert the IRI to a URI before using it
in the HTTP protocol, because the HTTP protocol only uses URIs
which have a restricted character repertoire.

>  - how a client indicates to the server that it's using UTF-8

depends on the context

>  - how an HTTP server application decides how to interpret
>    hex-encoded information, e.g. is %C3%B1 encoding two characters,
>    or the UTF-8 encoding for the single character "ñ"

The hex-encoding is only used for URIs and not for other elements,
and what it encodes depends entirely on the server that serves it.
This is entirely server-dependent, and %C3%B1 might represent
something entirely different, not characters at all. There is
no restriction that encoding in URIs actually corresponds to
character data.

>  - how/if a server might use UTF-8 in its response headers
> It looks like any content that is sent with MIME headers (e.g., an object
> sent by the HTTP server) could be announced with a charset value indicating
> UTF-8 encoding, but that headers (request or response) are only expected to
> contain characters 0x00 -> 0xFF. Yet I don't see this clearly stated.

Some response headers allow TEXT, e.g., on comments.

RFC 2616 section 14.46 (description of Warning header)

   If a character set other than ISO-8859-1 is used, it MUST be encoded
   in the warn-text using the method described in RFC 2047 [14].

RFC 2616, section 2.2, Basic Rules:
       OCTET          = <any 8-bit sequence of data>
       CHAR           = <any US-ASCII character (octets 0 - 127)>


   The TEXT rule is only used for descriptive field contents and values
   that are not intended to be interpreted by the message parser. Words
   of *TEXT MAY contain characters from character sets other than ISO-
   8859-1 [22] only when encoded according to the rules of RFC 2047

       TEXT           = <any OCTET except CTLs,
                        but including LWS>

This means that you can use UTF-16 if you first base64 encode it
and then use it within the RFC 2047 method, viz

> It seems fairly clear, though, that double-byte character sets (e.g., 16
> bits for each character regardless of its value) should not be used in
> either request or response headers. Right?

Not without some kind of encoding, but the encoding rules differ
according to the context.

Received on Saturday, 16 June 2001 17:07:42 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:40:25 UTC