Re: Closure on canonicalization, I hope from Marc VanHeyningen on 1994-12-24 (ietf-http-wg@w3.org from October to December 1994)

From: Marc VanHeyningen <mvanheyn@cs.indiana.edu>
Date: Fri, 23 Dec 1994 23:24:38 -0500
To: Larry Masinter <masinter@parc.xerox.com>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <10224.788243078@moose.cs.indiana.edu>

Thus wrote: Larry Masinter
>As long as we're going to redefine the Internet media types to be "as
>registered for MIME with some exceptions", I think we might as well
>handle the character set issue as well as the EOL convention one.

They do seem rather intertwined.  I, of course, have an absurdly naive
hope that, because the MIME folks have not yet set this issue into
even RFC-level stone let alone STD-level stone, it might be possible
for MIME and HTTP to doe it the same way.

>That is: object types in HTTP denote the corresponding Internet Media
>Type as defined in RFC 1521 and are registered via the procedure
>outlined in RFC 1590. However, HTTP makes two modifications to the 
>interpretation of "text/*" media types:
>
>a) the default character set (if no charset parameter is supplied) is
>   ISO-8859-1, not US-ASCII.
>
>b) no requirement is placed that documents which *could* be
>   interpreted as US-ASCII must be labelled such. (This was a MIME
>   requirement, but shouldn't be for HTTP).  

As I read the MIME spec, it says that US-ASCII text "should" be
labeled as US-ASCII and not some superset like ISO-8859-1 (and, in
general, meta-data should be specify the lowest common denominator
possible for understanding.)  I guess I don't really see the advantage
of changing the default in this fashion, though I don't see it as a
big problem either.  I just worry that a nontrivial subset of people
think that allowing 8859-1 addresses cross-linguistic use, which of
course it doesn't except for a few languages in a small part of the
world.

Environments where US-ASCII can be displayed but 8859-1 cannot are
still plentiful; I'm using one right now, and run it under Lynx all
the time (and have 8859-1 characters quietly converted to graphical
gibberish, which is not ideal behavior.)

>c) the end of line convention depends on the character set. In
>   particular, while US-ASCII requires 'CR LF' as a character set,
>   the end of line in ISO-8859-1 may consist of
>      a single CR
>      a single LF
>      a combined CR LF
>   and recievers must be prepared to determine the end of line
>   convention used in the text/ type.

Hmmm... I guess I don't see the connection.  Is broadening the
acceptable representations for EOL in 8859-1 significantly less
radical, more in line with current practice, or in some other
meaningful way preferable to just doing it for US-ASCII?

[ some stuff I mostly agree with deleted ]

>I'm happier with a definition that defines how clients are supposed to
>*interpret* the data than one that speaks of 'canonicalization' or
>'conversion', or where it is somehow the transport (HTTP) that is
>messing up. 

I think it's the same thing by different terms; as long as the
specification is clear and not guesswork-oriented.
--
Marc VanHeyningen  <URL:http://www.cs.indiana.edu/hyplan/mvanheyn.html>

Received on Friday, 23 December 1994 20:25:43 UTC