Re: Re: Re: Unknown text/* subtypes from Eric Prud'hommeaux on 2008-01-17 (ietf-http-wg@w3.org from January to March 2008)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Thu, 17 Jan 2008 18:48:58 -0500
To: Ned Freed <ned.freed@mrochek.com>
Cc: Ian Hickson <ian@hixie.ch>, Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>, ietf-http-wg@w3.org, ietf-types@alvestrand.no
Message-ID: <20080117234858.GI4974@w3.org>
* Ned Freed <ned.freed@mrochek.com> [2008-01-13 16:26-0800]
> > * Ian Hickson <ian@hixie.ch> [2008-01-13 05:47+0000]
> > > On Fri, 28 Dec 2007, Frank Ellermann wrote:
> > > >
> > > > Years later (after 2616bis) it might be possible to upgrade "default
> > > > ASCII" to UTF-8, Latin-1 was a dead end.  As soon as we're back to
> > > > "default ASCII" just let RFC 2277 finish it off.
> > >
> > > FWIW, a number of specs are already overriding both MIME and HTTP when it
> > > comes to character encodings. For example HTML4 says to not default to any
> > > encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as
> > > currently proposed defaults to an even more complicated heuristic [3], and
> > > so on.
> > >
> > > In the "real world" the implementations are following the heuristics
> > > described in CSS2.1 and HTML5 (or something close to them), and those
> > > differ for text/css and text/html, so it would seem pointless for HTTP to
> > > try to define something here: it would just get ignored.
> > >
> > > IMHO the best option is for HTTP to stay out of the discussion altogether
> > > and let the lower level specs (MIME) and the higher level specs (XML,
> > > HTML, CSS, etc, defining the formats) figure it out amongst themselves.
> 
> > I think this is consistent with Martin's proposal that HTTP1.1bis not
> > set a default encoding
> >   http://www.w3.org/2008/01/rdf-media-types#noDefault
> > (noting that Frank Ellerman believed the default should be us-ascii for
> >  the same effect)
> >   http://www.w3.org/2008/01/rdf-media-types#defAscii
> 
> > What we still need, however, is an update to 2046 that reflects
> > current practice (and eases the discovery process for folks
> > registering non-ascii text/ media types). Let's geek out the
> > changes to we'd like to see.
> 
> You might, and I emphasize might, be able to get this changed to protocol
> specific restriction. (The MIME specifications specify both an email-specific
> extension as well as some more generally useful facilities.) There is no chance
> of this rule being lifted in general.
> 
> > • CRLF rules:
> > [[
> >   The canonical form of any MIME "text" subtype MUST always represent
> >   a line break as a CRLF sequence.  Similarly, any occurrence of CRLF
> >   in MIME "text" MUST represent a line break.  Use of CR and LF
> >   outside of line break sequences is also forbidden.
> > ]] — RFC2046 §4.1.1 ¶1 http://www.rfc.net/rfc2046.html#s4.1.1.
> > is not respected by HTTP1.1, nor is it respected in general when
> > shipping text/xml.
> 
> > Does anyone rely on any vestige of this rule (e.g. mail clients, MTAs,
> > web servers, proxies or clients)?
> 
> Not only does email depend on this, conformance to this has been dramatically
> strengthened, not weakened, in subsequest revisions of the email protocol
> specification. Specifically, RFC 821 was essentially silent on what bare CR and
> LF mean, but 2821 and 2821bis (now in last call) both say that bare CR and LF
> MUST NOT be sent and if received MUST NOT be treated as CRLF.

I guess we can measure compliance by what causes the clients to wrap
lines. Attached is a set of documents in different encodings, each
served with text/x-unknown and no charset. I expet this is a violation
of 2046, but I wanted to see what clients actually do.

These documents are also available on the web:

http://www.w3.org/2008/01/text/CRLF.txu      CRs and LFs
http://www.w3.org/2008/01/text/latin-1-diac.txu     díâçrìtïcàls in latin-1
http://www.w3.org/2008/01/text/utf-8-diac.txu     díâçrìtïcàls in utf-8
http://www.w3.org/2008/01/text/utf-8-cyrillic.txu   сириллик in utf-8
http://www.w3.org/2008/01/text/utf-8-kanji.txu     漢字 in utf-8
http://www.w3.org/2008/01/text/shift-jis-kanji.txu  漢字 in shift_jis

and their behavior in various web clients is documented at
  http://esw.w3.org/topic/ConformanceNotes

> This, incidentially, is not the way I personally think things should have been
> done. I like the "ignore bare CR treat LF like CRLF" approach. But my personal
> opinion isn't especially relevant - I mention it only to avoid "shoot the
> messenger" sorts of responses.
> 
> > I would like to think that MIME
> > shouldn't care about recognizing new lines in the text block.
> 
> I'm sorry, but that's fanciful in the extreme.

Let me try provide a concrete argument against my proposal:

"Mail, News and HTTP clients all need to render text/ stuff on the
screen, and a shared understanding of CRLF is essential to the user
experience."

> > If it can't go away, can it be relaxed in accordance with HTTP 1.1
> > [[
> >   The line terminator for message-header fields is the sequence CRLF.
> >   However, we recommend that applications, when parsing such headers,
> >   recognize a single LF as a line terminator and ignore the leading
> >   CR.
> > ]] — RFC2616 §19.3 ¶3 http://www.rfc.net/rfc2616.html#s19.3
> 
> Again, I personally think this is the way to go. But that's not what
> has happened.

It's not what happened spec-wise, but I'm using this to see what happened
in the implementations.

> > or XML 1.1 (which includes NEXT LINE (NEL) and LINE SEPARATOR):
> > [[
> >    1. the two-character sequence #xD #xA
> 
> >    2. the two-character sequence #xD #x85
> 
> >    3. the single character #x85
> 
> >    4. the single character #x2028
> 
> >    5. any #xD character that is not immediately followed by #xA or
> >       #x85.
> > ]] — XML 1.1 §2.11 ¶2 http://www.w3.org/TR/xml11/#sec-line-ends
> 
> > The XML 1.1 rule interacts with character encoding because, while most
> > character encodings line up with ascii on CR and LF, clearly none do
> > on #x85 and #x2028
> 
> > • character encoding:
> > [[
> > Unlike some other parameter values, the values of the charset
> > parameter are NOT case sensitive.  The default character set, which
> > must be assumed in the absence of a charset parameter, is US-ASCII.
> 
> > The specification for any future subtypes of "text" must specify
> > whether or not they will also utilize a "charset" parameter, and may
> > possibly restrict its values as well.  For other subtypes of "text"
> > than "text/plain", the semantics of the "charset" parameter should be
> > defined to be identical to those specified here for "text/plain",
> > i.e., the body consists entirely of characters in the given charset.
> > In particular, definers of future "text" subtypes should pay close
> > attention to the implications of multioctet character sets for their
> > subtype definitions.
> 
> > The charset parameter for subtypes of "text" gives a name of a
> > character set, as "character set" is defined in RFC 2045.  The rules
> > regarding line breaks detailed in the previous section must also be
> > observed -- a character set whose definition does not conform to these
> > rules cannot be used in a MIME "text" subtype.
> > ]] — RFC2046 §4.1.2 ¶2-4 http://www.rfc.net/rfc2046.html#s4.1.2.
> 
> > When should the "default" character set apply?
> >   • no charset parameter
> >   • no charset parameter, no fixed encoding for the media type
> >   • no charset, no fixed encoding, no internal encoding declaration
> 
> > The current text specifies the first, while HTML and CSS count on the
> > third. From the use case of "best effort rendering", we are already in
> > a state where users who are better-informed than their web or mail
> > clients manually set the encoding so they can see the right
> > characters. The following heuristics may meet or exceed the user
> > experience with today's data while advancing the state of the art to
> > enable better rendering with future data:
> > [[
> > Unlike some other parameter values, the values of the charset
> > parameter are NOT case sensitive. The first of the following
> > determinants that apply will identify the character set:
> 
> >   1. charset parameter
> 
> >   2. fixed encoding registered with the media type, if known
> 
> >   3. encoding algorithm registered with the media type, if known
> 
> >   4. UFT-8 if the document conforms to the UTF-8 encoding pattern
> 
> >   5. ISO-8859-1 if all the octets are in [\r\n\x20-\x7e]
> 
> >   6. application preference
> > ]]
> 
> Again, there is absolutely no chance this will fly for email so it cannot be 
> written with this degree of generality. And if this is made protocol specific
> the specifics of any protocol other than email don't belong in a RFC 2046
> revision.

fair point. HTTP1.1bis folks may want to consider this.

> > @@charset constraints — can it have faux line feeds?
> 
> > @@bidi? Martin, what do you think?
> 
> > @@lowest common demoninator:
> >   RFC2046 §4.1.2 ¶22 http://www.rfc.net/rfc2046.html#s4.1.2.
> > Is it better to encourage the world to write "UTF-8" or "US-ASCII"
> > for ascii subset? tension between lcd and one common encoding.
> 
> Marking something as utf-8 when it is in fact restricted to the us-ascii subset
> has been known to cause problems. I think change in this area is unlikely.

tx for your thoughts and attention on this.

>     Ned


@@is a "no-change revision" is changing things if it
documents current practice contrary to the old spec.
-- 
-eric

office: +1.617.258.5741 32-G528, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Thursday, 17 January 2008 23:49:52 UTC