- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Thu, 17 Jan 2008 18:48:58 -0500
- To: Ned Freed <ned.freed@mrochek.com>
- Cc: Ian Hickson <ian@hixie.ch>, Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>, ietf-http-wg@w3.org, ietf-types@alvestrand.no
- Message-ID: <20080117234858.GI4974@w3.org>
* Ned Freed <ned.freed@mrochek.com> [2008-01-13 16:26-0800] > > * Ian Hickson <ian@hixie.ch> [2008-01-13 05:47+0000] > > > On Fri, 28 Dec 2007, Frank Ellermann wrote: > > > > > > > > Years later (after 2616bis) it might be possible to upgrade "default > > > > ASCII" to UTF-8, Latin-1 was a dead end. As soon as we're back to > > > > "default ASCII" just let RFC 2277 finish it off. > > > > > > FWIW, a number of specs are already overriding both MIME and HTTP when it > > > comes to character encodings. For example HTML4 says to not default to any > > > encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as > > > currently proposed defaults to an even more complicated heuristic [3], and > > > so on. > > > > > > In the "real world" the implementations are following the heuristics > > > described in CSS2.1 and HTML5 (or something close to them), and those > > > differ for text/css and text/html, so it would seem pointless for HTTP to > > > try to define something here: it would just get ignored. > > > > > > IMHO the best option is for HTTP to stay out of the discussion altogether > > > and let the lower level specs (MIME) and the higher level specs (XML, > > > HTML, CSS, etc, defining the formats) figure it out amongst themselves. > > > I think this is consistent with Martin's proposal that HTTP1.1bis not > > set a default encoding > > http://www.w3.org/2008/01/rdf-media-types#noDefault > > (noting that Frank Ellerman believed the default should be us-ascii for > > the same effect) > > http://www.w3.org/2008/01/rdf-media-types#defAscii > > > What we still need, however, is an update to 2046 that reflects > > current practice (and eases the discovery process for folks > > registering non-ascii text/ media types). Let's geek out the > > changes to we'd like to see. > > You might, and I emphasize might, be able to get this changed to protocol > specific restriction. (The MIME specifications specify both an email-specific > extension as well as some more generally useful facilities.) There is no chance > of this rule being lifted in general. > > > • CRLF rules: > > [[ > > The canonical form of any MIME "text" subtype MUST always represent > > a line break as a CRLF sequence. Similarly, any occurrence of CRLF > > in MIME "text" MUST represent a line break. Use of CR and LF > > outside of line break sequences is also forbidden. > > ]] — RFC2046 §4.1.1 ¶1 http://www.rfc.net/rfc2046.html#s4.1.1. > > is not respected by HTTP1.1, nor is it respected in general when > > shipping text/xml. > > > Does anyone rely on any vestige of this rule (e.g. mail clients, MTAs, > > web servers, proxies or clients)? > > Not only does email depend on this, conformance to this has been dramatically > strengthened, not weakened, in subsequest revisions of the email protocol > specification. Specifically, RFC 821 was essentially silent on what bare CR and > LF mean, but 2821 and 2821bis (now in last call) both say that bare CR and LF > MUST NOT be sent and if received MUST NOT be treated as CRLF. I guess we can measure compliance by what causes the clients to wrap lines. Attached is a set of documents in different encodings, each served with text/x-unknown and no charset. I expet this is a violation of 2046, but I wanted to see what clients actually do. These documents are also available on the web: http://www.w3.org/2008/01/text/CRLF.txu CRs and LFs http://www.w3.org/2008/01/text/latin-1-diac.txu díâçrìtïcàls in latin-1 http://www.w3.org/2008/01/text/utf-8-diac.txu díâçrìtïcàls in utf-8 http://www.w3.org/2008/01/text/utf-8-cyrillic.txu сириллик in utf-8 http://www.w3.org/2008/01/text/utf-8-kanji.txu 漢字 in utf-8 http://www.w3.org/2008/01/text/shift-jis-kanji.txu 漢字 in shift_jis and their behavior in various web clients is documented at http://esw.w3.org/topic/ConformanceNotes > This, incidentially, is not the way I personally think things should have been > done. I like the "ignore bare CR treat LF like CRLF" approach. But my personal > opinion isn't especially relevant - I mention it only to avoid "shoot the > messenger" sorts of responses. > > > I would like to think that MIME > > shouldn't care about recognizing new lines in the text block. > > I'm sorry, but that's fanciful in the extreme. Let me try provide a concrete argument against my proposal: "Mail, News and HTTP clients all need to render text/ stuff on the screen, and a shared understanding of CRLF is essential to the user experience." > > If it can't go away, can it be relaxed in accordance with HTTP 1.1 > > [[ > > The line terminator for message-header fields is the sequence CRLF. > > However, we recommend that applications, when parsing such headers, > > recognize a single LF as a line terminator and ignore the leading > > CR. > > ]] — RFC2616 §19.3 ¶3 http://www.rfc.net/rfc2616.html#s19.3 > > Again, I personally think this is the way to go. But that's not what > has happened. It's not what happened spec-wise, but I'm using this to see what happened in the implementations. > > or XML 1.1 (which includes NEXT LINE (NEL) and LINE SEPARATOR): > > [[ > > 1. the two-character sequence #xD #xA > > > 2. the two-character sequence #xD #x85 > > > 3. the single character #x85 > > > 4. the single character #x2028 > > > 5. any #xD character that is not immediately followed by #xA or > > #x85. > > ]] — XML 1.1 §2.11 ¶2 http://www.w3.org/TR/xml11/#sec-line-ends > > > The XML 1.1 rule interacts with character encoding because, while most > > character encodings line up with ascii on CR and LF, clearly none do > > on #x85 and #x2028 > > > • character encoding: > > [[ > > Unlike some other parameter values, the values of the charset > > parameter are NOT case sensitive. The default character set, which > > must be assumed in the absence of a charset parameter, is US-ASCII. > > > The specification for any future subtypes of "text" must specify > > whether or not they will also utilize a "charset" parameter, and may > > possibly restrict its values as well. For other subtypes of "text" > > than "text/plain", the semantics of the "charset" parameter should be > > defined to be identical to those specified here for "text/plain", > > i.e., the body consists entirely of characters in the given charset. > > In particular, definers of future "text" subtypes should pay close > > attention to the implications of multioctet character sets for their > > subtype definitions. > > > The charset parameter for subtypes of "text" gives a name of a > > character set, as "character set" is defined in RFC 2045. The rules > > regarding line breaks detailed in the previous section must also be > > observed -- a character set whose definition does not conform to these > > rules cannot be used in a MIME "text" subtype. > > ]] — RFC2046 §4.1.2 ¶2-4 http://www.rfc.net/rfc2046.html#s4.1.2. > > > When should the "default" character set apply? > > • no charset parameter > > • no charset parameter, no fixed encoding for the media type > > • no charset, no fixed encoding, no internal encoding declaration > > > The current text specifies the first, while HTML and CSS count on the > > third. From the use case of "best effort rendering", we are already in > > a state where users who are better-informed than their web or mail > > clients manually set the encoding so they can see the right > > characters. The following heuristics may meet or exceed the user > > experience with today's data while advancing the state of the art to > > enable better rendering with future data: > > [[ > > Unlike some other parameter values, the values of the charset > > parameter are NOT case sensitive. The first of the following > > determinants that apply will identify the character set: > > > 1. charset parameter > > > 2. fixed encoding registered with the media type, if known > > > 3. encoding algorithm registered with the media type, if known > > > 4. UFT-8 if the document conforms to the UTF-8 encoding pattern > > > 5. ISO-8859-1 if all the octets are in [\r\n\x20-\x7e] > > > 6. application preference > > ]] > > Again, there is absolutely no chance this will fly for email so it cannot be > written with this degree of generality. And if this is made protocol specific > the specifics of any protocol other than email don't belong in a RFC 2046 > revision. fair point. HTTP1.1bis folks may want to consider this. > > @@charset constraints — can it have faux line feeds? > > > @@bidi? Martin, what do you think? > > > @@lowest common demoninator: > > RFC2046 §4.1.2 ¶22 http://www.rfc.net/rfc2046.html#s4.1.2. > > Is it better to encourage the world to write "UTF-8" or "US-ASCII" > > for ascii subset? tension between lcd and one common encoding. > > Marking something as utf-8 when it is in fact restricted to the us-ascii subset > has been known to cause problems. I think change in this area is unlikely. tx for your thoughts and attention on this. > Ned @@is a "no-change revision" is changing things if it documents current practice contrary to the old spec. -- -eric office: +1.617.258.5741 32-G528, MIT, Cambridge, MA 02144 USA mobile: +1.617.599.3509 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Thursday, 17 January 2008 23:49:52 UTC