Re: IRIs, IDNAbis, and HTTP from Frank Ellermann on 2008-03-14 (ietf-http-wg@w3.org from January to March 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Fri, 14 Mar 2008 14:25:55 +0100
To: ietf-http-wg@w3.org
Message-ID: <frdu80$10v$1@ger.gmane.org>

Brian Smith wrote:

> Consider:
>   Content-Type: text/plain;charset="=?utf-8?q?utf-8?="
>   (how do you compare this against 'text/plain;charset="utf-8"'?)
[...]
> the grammar seems to allow encoded-words to be mixed with unencoded
> words.

Yes.  FWS between encoded words is removed by decoders, that's how
you get around the length limit, just insert FWS between characters
in the encoder.  Not between input octets, you can't "split" UTF-8.

FWS between encoded and unencoded words is for real (= white space),
like FWS beween unencoded words.  (FWS is "folding white space",
the older MIME RFCs still use "linear white space" like RFC 2616).

> multiple encodings (e.g. UTF-8 and UTF-7) to be mixed.

Yep, I like pc-multilingual-850+euro better than UTF-7, it arrives
faster at the critical "75", especially if combined with a *long*
RFC 4646 language tag as specified in RFC 2231.

Back to your original point, you don't 2047-encode quoted-strings,
RFC 2047 says: 

| + An 'encoded-word' MUST NOT appear within a 'quoted-string'.
[...]
| + An 'encoded-word' MUST NOT be used in parameter of a MIME
|   Content-Type or Content-Disposition field, or in any structured
|   field body except within a 'comment' or 'phrase'.

Roughly the idea is that you can 2047-encode unstructured header
fields as you like (example: Subject in mail or news).  For any
structured header field you must not touch its structure, a
comment must be still a comment, a mail address is must not be
touched at all, ditto quoted-string etc. (see above).

Example: This (is) a test

For a structured header field with name "Example" and field body
"This (is) a test", the four words actually in some interesting
charset, you can 2047-encode "This", "is", "a", "test", "a test".

You cannot encode anything with "(is)", it would obscure the
structure, here the comment "(" and ")".  The goal is, that an
MTA or MUA knowing nothing about MIME at all, can simply treat
(=?us-ascii*tlh?Q?is?=) or similar as some weird ASCII-word in
a comment, it doesn't need to know that it's an US-ASCII Klingon
"is", but it needs to know that it's a comment.  

Example: "umlauted gibberish" 

You can't encode the gibberish within the quoted-string.  But
for structured fields quoted-string *always* (please check this)
is used where unquoted words are allowed.  In other words you
cannot do "=?utf-8?Q?umlauted_gibberish?=" within quotes.  But
you can use =?utf-8?Q?umlauted_gibberish?= without quotes as
ordinary word ("ordinary" from the POV of a MIME-agnostic MTA).

 Frank

Received on Friday, 14 March 2008 13:23:57 UTC