May HTML5 mandate interpretation of text/plain? from Incnis Mrsi on 2015-09-23 (www-international@w3.org from July to September 2015)

From: Incnis Mrsi <browser@superstructure.info>
Date: Wed, 23 Sep 2015 23:53:33 +0300
To: www-international@w3.org
Message-ID: <HTML5vsTextPlain.browser@superstructure.info>
Hello.

I hereby inform the WWW Consortium about the thing I deem a bug 
present in several modern mainstream browsers. 
This bug causes browsers, under certain conditions, 
interpret a perfectly well-formed HTTP/1.1 text/plain document 
incorrectly, defying the protocol’s specification.


	1. Background

In 1997, HTTP/1.1 was published that 
mandated explicit “charset=” for text documents 
having *any* charset not a subset of ISO/IEC 8859-1.
This was reiterated in 2000 (by RFC 2616), 
but low awareness of httpd admins caused the WWW Consortium 
to publish a statement http://www.w3.org/TR/html4/charset.html#h-5.2.2 
that virtually invalidates provisions of the section 3.7.1. of RFC 2616 
with respect to HTML documents (text/html).

When out-of-band data are absent (such as in files, usually), 
determining character encoding of a byte stream becomes a problem.
Specifying the name of encoding in HTML files is an established practice. 
But text/plain documents lack any such workaround. 
For _Unicode_ encoding, sniffing (heuristic determination) 
between UTF-8/UTF-16LE/UTF-16BE is fairly reliable 
when BOM is placed at the beginning. It’s not the case without BOM. 
Neither can one reliably guess whether 
the document in an unspecified encoding is Unicode, 
even if BOM is expected in Unicode cases.

In 2014, HTML5 specification was approved. 
In section 8.2.2. http://www.w3.org/TR/html5/syntax.html#the-input-byte-stream 
it specified so named “encoding sniffing algorithm” 
that, evidently, reflects an increasing share of Unicode-encoded HTMLs. 
Note that an HTML document is a very special case of a text document, 
if only because the first HTML’s character is normally “<”. 
Practices beneficial for HTML would not necessarily be successful 
for other “text/” subtypes, but wait… 
where the specification claims that _other_ text types should follow it? 
AFAIK there is no such claim.


	2. Facts

I prepared a simple test case at http://course.irccity.ru/ya-yu-9-amp.txt .
Let’s see what lies inside (with netcat or other low-level tool):

GET /ya-yu-9-amp.txt HTTP/1.1
Host: course.irccity.ru

HTTP/1.1 200 OK
Server: nginx/1.6.2
Date: Wed, 23 Sep 2015 19:21:41 GMT
Content-Type: text/plain; charset=Windows-1251
Content-Length: 4
Connection: keep-alive
Last-Modified: Wed, 23 Sep 2015 08:50:48 GMT
ETag: "62432b5-4-5206635fbca00"
Accept-Ranges: bytes
Content-Language: ru

яю9&

In words, we see a HTTP/1.1-compliant text/plain document 
encoded in Windows-1251 and containing exactly four characters: 
Cyrillic letters “ya” and “yu” (lowercase), 
ASCII digit nine, and ampersand character.

MS Internet Explorer shows Unicode frown face instead. 
Unsurprisingly, since Bush hid the facts. 
But what do show Google Chrome and Firefox? The same thing.


	3. Analysis

Although ya-yu-9-amp.txt isn’t HTML, 
and doesn’t pretend to be HTML in any reasonable way, 
browsers in question evidently apply so named “encoding sniffing”. 
Look:
 • The document is explicity labelled as “text/plain”.
 • The document doesn’t contain a single “<” (in any interpretation).
 • The document is too short to be a HTML.

Although “text/html” is the most important of “text/” media types in the WWW, 
webmasters using subtypes other than “html” are expected to be 
qualified enough to supply connect HTTP headers with them; 
the history showed that it’s not the case for “text/html”.
Overriding HTTP/1.1-compliant behaviour opens a possibility 
of unexplainable data losses with “text/” other than “html”, 
with probability about 1/25032 (40 ppm) 
for a uniformly distributed non-C0 (i.e. \040–\377) octet stream; 
see the table under the point 3. of the “encoding sniffing algorithm”.

We can now guess that, when applying the point 3. of the algorithm 
(that, as stated by the spec, “is a willful violation of the HTTP specification”) 
browsers forget to check whether document type is HTML. 
Or, possibly, don’t care about Content-Type at all, 
that may open even broader entrance for glitches and expliots.


	4. Conclusion

A small modification of browsers’ logic (namely, executing pp. 2., 3. 
conditionally for HTML media type(s) only) could fix the bug. 
But it would be helpful to have an official W3 statement, something like 
“don’t ignore Content-Type, and use Unicode sniffing for HTML only”, 
before submitting bug reports to developers.

Regards, Incnis Mrsi
Received on Thursday, 24 September 2015 16:22:26 UTC