- From: Incnis Mrsi <browser@superstructure.info>
- Date: Wed, 23 Sep 2015 23:53:33 +0300
- To: www-international@w3.org
Hello. I hereby inform the WWW Consortium about the thing I deem a bug present in several modern mainstream browsers. This bug causes browsers, under certain conditions, interpret a perfectly well-formed HTTP/1.1 text/plain document incorrectly, defying the protocol’s specification. 1. Background In 1997, HTTP/1.1 was published that mandated explicit “charset=” for text documents having *any* charset not a subset of ISO/IEC 8859-1. This was reiterated in 2000 (by RFC 2616), but low awareness of httpd admins caused the WWW Consortium to publish a statement http://www.w3.org/TR/html4/charset.html#h-5.2.2 that virtually invalidates provisions of the section 3.7.1. of RFC 2616 with respect to HTML documents (text/html). When out-of-band data are absent (such as in files, usually), determining character encoding of a byte stream becomes a problem. Specifying the name of encoding in HTML files is an established practice. But text/plain documents lack any such workaround. For _Unicode_ encoding, sniffing (heuristic determination) between UTF-8/UTF-16LE/UTF-16BE is fairly reliable when BOM is placed at the beginning. It’s not the case without BOM. Neither can one reliably guess whether the document in an unspecified encoding is Unicode, even if BOM is expected in Unicode cases. In 2014, HTML5 specification was approved. In section 8.2.2. http://www.w3.org/TR/html5/syntax.html#the-input-byte-stream it specified so named “encoding sniffing algorithm” that, evidently, reflects an increasing share of Unicode-encoded HTMLs. Note that an HTML document is a very special case of a text document, if only because the first HTML’s character is normally “<”. Practices beneficial for HTML would not necessarily be successful for other “text/” subtypes, but wait… where the specification claims that _other_ text types should follow it? AFAIK there is no such claim. 2. Facts I prepared a simple test case at http://course.irccity.ru/ya-yu-9-amp.txt . Let’s see what lies inside (with netcat or other low-level tool): GET /ya-yu-9-amp.txt HTTP/1.1 Host: course.irccity.ru HTTP/1.1 200 OK Server: nginx/1.6.2 Date: Wed, 23 Sep 2015 19:21:41 GMT Content-Type: text/plain; charset=Windows-1251 Content-Length: 4 Connection: keep-alive Last-Modified: Wed, 23 Sep 2015 08:50:48 GMT ETag: "62432b5-4-5206635fbca00" Accept-Ranges: bytes Content-Language: ru ÿþ9& In words, we see a HTTP/1.1-compliant text/plain document encoded in Windows-1251 and containing exactly four characters: Cyrillic letters “ya” and “yu” (lowercase), ASCII digit nine, and ampersand character. MS Internet Explorer shows Unicode frown face instead. Unsurprisingly, since Bush hid the facts. But what do show Google Chrome and Firefox? The same thing. 3. Analysis Although ya-yu-9-amp.txt isn’t HTML, and doesn’t pretend to be HTML in any reasonable way, browsers in question evidently apply so named “encoding sniffing”. Look: • The document is explicity labelled as “text/plain”. • The document doesn’t contain a single “<” (in any interpretation). • The document is too short to be a HTML. Although “text/html” is the most important of “text/” media types in the WWW, webmasters using subtypes other than “html” are expected to be qualified enough to supply connect HTTP headers with them; the history showed that it’s not the case for “text/html”. Overriding HTTP/1.1-compliant behaviour opens a possibility of unexplainable data losses with “text/” other than “html”, with probability about 1/25032 (40 ppm) for a uniformly distributed non-C0 (i.e. \040–\377) octet stream; see the table under the point 3. of the “encoding sniffing algorithm”. We can now guess that, when applying the point 3. of the algorithm (that, as stated by the spec, “is a willful violation of the HTTP specification”) browsers forget to check whether document type is HTML. Or, possibly, don’t care about Content-Type at all, that may open even broader entrance for glitches and expliots. 4. Conclusion A small modification of browsers’ logic (namely, executing pp. 2., 3. conditionally for HTML media type(s) only) could fix the bug. But it would be helpful to have an official W3 statement, something like “don’t ignore Content-Type, and use Unicode sniffing for HTML only”, before submitting bug reports to developers. Regards, Incnis Mrsi
Received on Thursday, 24 September 2015 16:22:26 UTC