- From: Larry Masinter <LMM@acm.org>
- Date: Sat, 16 Jun 2001 09:02:54 -0700
- To: "Peter W" <peterw@usa.net>
- Cc: <http-wg@cuckoo.hpl.hp.com>, "Martin J. Dürst" <duerst@w3.org>
> - whether Unicode characters with values 0x100 and greater are allowed in > request headers (especially the request line) > - if so, if UTF-8 encoding is allowed The "request line" and "request headers" are different contexts. If you're thinking about using IRIs as the request URI (draft-masinter-uri-i18n-07.txt) it is still necessary to convert the IRI to a URI before using it in the HTTP protocol, because the HTTP protocol only uses URIs which have a restricted character repertoire. > - how a client indicates to the server that it's using UTF-8 depends on the context > - how an HTTP server application decides how to interpret > hex-encoded information, e.g. is %C3%B1 encoding two characters, > or the UTF-8 encoding for the single character "ñ" The hex-encoding is only used for URIs and not for other elements, and what it encodes depends entirely on the server that serves it. This is entirely server-dependent, and %C3%B1 might represent something entirely different, not characters at all. There is no restriction that encoding in URIs actually corresponds to character data. > - how/if a server might use UTF-8 in its response headers > It looks like any content that is sent with MIME headers (e.g., an object > sent by the HTTP server) could be announced with a charset value indicating > UTF-8 encoding, but that headers (request or response) are only expected to > contain characters 0x00 -> 0xFF. Yet I don't see this clearly stated. Some response headers allow TEXT, e.g., on comments. RFC 2616 section 14.46 (description of Warning header) If a character set other than ISO-8859-1 is used, it MUST be encoded in the warn-text using the method described in RFC 2047 [14]. RFC 2616, section 2.2, Basic Rules: OCTET = <any 8-bit sequence of data> CHAR = <any US-ASCII character (octets 0 - 127)> ... The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. TEXT = <any OCTET except CTLs, but including LWS> This means that you can use UTF-16 if you first base64 encode it and then use it within the RFC 2047 method, viz =?UTF-8?b?<base64string>?= > It seems fairly clear, though, that double-byte character sets (e.g., 16 > bits for each character regardless of its value) should not be used in > either request or response headers. Right? Not without some kind of encoding, but the encoding rules differ according to the context. Larry -- http://larry.masinter.net
Received on Saturday, 16 June 2001 17:07:42 UTC