Re: Factoring out Content-Disposition (i123) from Frank Ellermann on 2008-08-16 (ietf-http-wg@w3.org from July to September 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Sat, 16 Aug 2008 04:43:08 +0200
To: ietf-http-wg@w3.org
Message-ID: <g85eoa$3h0$1@ger.gmane.org>

Brian Smith wrote:

> RFC 2231 + UTF-8 is an especially bad interchange format
> for text since it requires over 9 bytes per letter

The length is no obstacle for HTTP, and as you write later,
we are anyway talking about relatively short strings.  For
some languages legacy charsets will be "better" than UTF-8
wrt "compression", but I think interoperability is more
relevant for our purposes.

There is no way to jump from "raw Latin-1" to "raw UTF-8"
in HTTP/1.1 headers, any mixtures would be a horrible mess.

If the WG meeting had a coherent transition strategy, e.g.,
"stick to Latin-1 in HTTP/1.1, do UTF-8 later in HTTP/1.2",
or "deprecate Latin-1 in HTTP/1.1 now, introduce UTF-8 in
HTTP/1.1 later", I'd like to see precise minutes about it.

JFTR, again, I think we need a clear transition strategy.
But so far we don't have it.

> there are no features for language tagging

Of course there are.  Raw UTF-8 doesn't offer this, unless
you try the NOT RECOMMENDED obscure u+E00?? language tags.

> BIDI (needed for middle-eastern languages)

All charsets needing this, not limited to UTF-8, offer it.
The gibbous RFC 2231 percent-encoding doesn't change this.

> it is only suitable for short, language-neutral strings
> like (file and IRI) path fragments.

Do you propose to remove the optional [language] element
in the draft ?  It's a possibility, but some lines above
you said language tagging is essential.

> The draft references Unicode 4.0 indirectly through
> RFC3629.

Strong NAK.  STD 63 is not, repeat NOT, bound to some
specific Unicode version.  In a parallel universe where
the Unicode Consortium tried to redefine UTF-8 they'd
be disappointed when STD 63 sticks to the definition 
as it was Unicode 4.  But that is bad science fiction.

And it doesn't affect the set of assigned code points,
UTF-8 can do anything up to u+10FFF as specified in 
STD 63.  Other non-IETF UTF-8 specifications are less
relevant for our purposes.  (As you see I believe in
bad science fiction, after ISO 29500.  Therefore it's
IMO perfect to reference STD 63).

> I don't see the point of requiring ISO-8859-1.

See above, so far all proposals to ditch Latin-1 didn't
make it.  As long as that doesn't change Latin-1 is the
only permitted form of any non-ASCII octets in HTTP/1.1
headers.

 Frank

Received on Saturday, 16 August 2008 02:42:09 UTC