- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Fri, 25 Sep 2015 14:52:43 +0900
- To: Incnis Mrsi <browser@superstructure.info>, <www-international@w3.org>
Hello Incnis, I'm just replying personally, not for the W3C. I think you should address yourself to WHATWG, since that's where @@@@ On 2015/09/24 05:53, Incnis Mrsi wrote: > Hello. > > I hereby inform the WWW Consortium about the thing I deem a bug present > in several modern mainstream browsers. This bug causes browsers, under > certain conditions, interpret a perfectly well-formed HTTP/1.1 > text/plain document incorrectly, defying the protocol’s specification. > > > 1. Background > > In 1997, HTTP/1.1 was published that mandated explicit “charset=” for > text documents having *any* charset not a subset of ISO/IEC 8859-1. > This was reiterated in 2000 (by RFC 2616), but low awareness of httpd > admins caused the WWW Consortium to publish a statement > http://www.w3.org/TR/html4/charset.html#h-5.2.2 that virtually > invalidates provisions of the section 3.7.1. of RFC 2616 with respect to > HTML documents (text/html). Yes, the iso-8859-1 'default' was invalidated because there were millions and millions of documents for which it would have been wrong, especially in Eastern Europe and Asia. > When out-of-band data are absent (such as in files, usually), > determining character encoding of a byte stream becomes a problem. > Specifying the name of encoding in HTML files is an established > practice. But text/plain documents lack any such workaround. For > _Unicode_ encoding, sniffing (heuristic determination) between > UTF-8/UTF-16LE/UTF-16BE is fairly reliable when BOM is placed at the > beginning. It’s not the case without BOM. Neither can one reliably guess > whether the document in an unspecified encoding is Unicode, even if BOM > is expected in Unicode cases. Even without the BOM, its is still fairly reliable. For some details, please see http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf. > In 2014, HTML5 specification was approved. In section 8.2.2. > http://www.w3.org/TR/html5/syntax.html#the-input-byte-stream it > specified so named “encoding sniffing algorithm” that, evidently, > reflects an increasing share of Unicode-encoded HTMLs. Note that an HTML > document is a very special case of a text document, if only because the > first HTML’s character is normally “<”. Practices beneficial for HTML > would not necessarily be successful for other “text/” subtypes, but > wait… where the specification claims that _other_ text types should > follow it? AFAIK there is no such claim. I also haven't found it, but I haven't had time to look very thoroughly. > 2. Facts > > I prepared a simple test case at http://course.irccity.ru/ya-yu-9-amp.txt . > Let’s see what lies inside (with netcat or other low-level tool): > > GET /ya-yu-9-amp.txt HTTP/1.1 > Host: course.irccity.ru > > HTTP/1.1 200 OK > Server: nginx/1.6.2 > Date: Wed, 23 Sep 2015 19:21:41 GMT > Content-Type: text/plain; charset=Windows-1251 > Content-Length: 4 > Connection: keep-alive > Last-Modified: Wed, 23 Sep 2015 08:50:48 GMT > ETag: "62432b5-4-5206635fbca00" > Accept-Ranges: bytes > Content-Language: ru > > яю9& > > In words, we see a HTTP/1.1-compliant text/plain document encoded in > Windows-1251 and containing exactly four characters: Cyrillic letters > “ya” and “yu” (lowercase), ASCII digit nine, and ampersand character. > > MS Internet Explorer shows Unicode frown face instead. Unsurprisingly, > since Bush hid the facts. But what do show Google Chrome and Firefox? > The same thing. This is a cute, but nonsensical, example. > 3. Analysis > > Although ya-yu-9-amp.txt isn’t HTML, and doesn’t pretend to be HTML in > any reasonable way, browsers in question evidently apply so named > “encoding sniffing”. Look: > • The document is explicity labelled as “text/plain”. > • The document doesn’t contain a single “<” (in any interpretation). > • The document is too short to be a HTML. > > Although “text/html” is the most important of “text/” media types in the > WWW, webmasters using subtypes other than “html” are expected to be > qualified enough to supply connect HTTP headers with them; the history > showed that it’s not the case for “text/html”. > Overriding HTTP/1.1-compliant behaviour opens a possibility of > unexplainable data losses with “text/” other than “html”, with > probability about 1/25032 (40 ppm) for a uniformly distributed non-C0 > (i.e. \040–\377) octet stream; see the table under the point 3. of the > “encoding sniffing algorithm”. > > We can now guess that, when applying the point 3. of the algorithm > (that, as stated by the spec, “is a willful violation of the HTTP > specification”) browsers forget to check whether document type is HTML. > Or, possibly, don’t care about Content-Type at all, that may open even > broader entrance for glitches and expliots. I don't think that interpreting a text/plain document with a wrong character encoding can lead to exploits, because text/plain content isn't executed in any way. > 4. Conclusion > > A small modification of browsers’ logic (namely, executing pp. 2., 3. > conditionally for HTML media type(s) only) could fix the bug. But it > would be helpful to have an official W3 statement, something like “don’t > ignore Content-Type, and use Unicode sniffing for HTML only”, before > submitting bug reports to developers. I would just go ahead and submit this as a bug. However, I think it would be good to have a better example text. The current example text is just nonsense, as far as I understand. Regards, Martin. > Regards, Incnis Mrsi > > > . >
Received on Friday, 25 September 2015 05:53:33 UTC