RE: Factoring out Content-Disposition (i123), was: Content-Disposition (new issue?) from Brian Smith on 2008-08-15 (ietf-http-wg@w3.org from July to September 2008)

From: Brian Smith <brian@briansmith.org>
Date: Fri, 15 Aug 2008 17:33:16 -0500
To: "'Julian Reschke'" <julian.reschke@gmx.de>
Cc: <ietf-http-wg@w3.org>
Message-ID: <AC01D061970749D89C9D1904D88BBBAC@T60>
Julian Reschke wrote:
> Brian Smith wrote:
> > RFC 2231 + UTF-8 is an especially bad interchange format for text 
> > since it requires over 9 bytes per letter for the vast majority of 
> > people's native
> 
> Making UTF-16 support mandatory could help her, but I'm not 
> sure how widespread support for that is (recall I'm trying to 
> document what several UAs already do and have been doing for 
> a long time). Will keep this in mind when writing test cases.

I don't think UTF-16 support is worthwhile. I think it isn't an issue if the
intention is to support only very short, non-prose text like filenames (see
below). It seems there is some agreement that HTTP headers should not
contain human-oriented text. My concern is that having a separate standard
for RFC2231 in HTTP will promote the idea of human-oriented text in headers
instead of discouraging it.

> > language-sensitive text out of HTTP as much as possible by 
> > recommending that applications transfer language-sensitive text in 
> > entity bodies as much as possible. Really, it is only suitable for 
> > short, language-neutral  strings like (file and IRI) path fragments.
> 
> That's something I agree with. For instance, WebDAV doesn't 
> suffer from these kinds of problems because anything that is 
> text actually travels in entity bodies as XML.
> 
> That being said, you can't always avoid it, such as in 
> Content-Disposition or Slug.
> 

Since the primary (only?) use case for RFC2231 in HTTP is the
Content-Disposition header, why not just fold this into the spec. that you
are writing for Content-Disposition? URI references are already
ASCII-encoded IRIs, and Atom's Slug header field already has its own
mechanism for handling non-ASCII text.

> > languages. Plus, there are no features for language tagging (needed 
> > for CJK languages), BIDI (needed for middle-eastern languages), or 
> > accessibility (for users of screen readers). IMO, the best 
> > thing to do is to keep
> 
> RFC 2231 *does* include language tagging. WRT BIDI I'm no 
> expert, but I thought Unicode has something to say here?
> And could you clarify the accessibility concern please?

Language tagging, BIDI, and accessibility features are not really necessary
for the specific case of filenames. Those issues come into play when you try
to define a general-purpose mechanism for supporting human-oriented text.
For example, RFC 2231 only allows a language tag for the entire parameter
value, but doesn't provide a means of handling mixed-language text.

> 
> > Nitpicks:
> > 
> > The draft references Unicode 4.0 indirectly through 
> > RFC3629. It would be better to allow implementations to use
> > any later versions, or at least the current version, 5.1.
> 
> Yes, that's a nit, isn't it :-).

Yes, but this issue seems to always come up when specifications reference
Unicode documents.

> > I don't see the point of requiring ISO-8859-1. ISO-8859-1 can only 
> > encode a very small number of languages that are used by a small 
> > minority of people (who just happen to be over-represented in 
> > standards committees). Advocating
> > ISO-8859-1 also seems to be the opposite of what was 
> > discussed at the IETF meeting (AFAICT from the logs).
> 
> I originally want to mandate UTF-8 only, but people pointed 
> out (rightfully), that any HTTP software already needs to 
> understand ISO-8859-1, so it really doesn't make a difference.

Judging from Roy's response, it looks like software won't have to understand
more than ASCII, though they will have to tolerate non-ASCII bytes
(presumably, regardless of whether those bytes can be decoded into valid
characters in any encoding). Historically, ISO-8859-1 seems to be very
difficult for implementers to get right since Windows-1252 and other similar
encodings are often sent as ISO-8859-1.

- Brian
Received on Friday, 15 August 2008 22:33:58 UTC