Re: May HTML5 mandate interpretation of text/plain? from Martin J. Dürst on 2015-09-25 (www-international@w3.org from July to September 2015)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 25 Sep 2015 14:52:43 +0900
To: Incnis Mrsi <browser@superstructure.info>, <www-international@w3.org>
Message-ID: <5604E12B.6050600@it.aoyama.ac.jp>
Hello Incnis,

I'm just replying personally, not for the W3C. I think you should 
address yourself to WHATWG, since that's where @@@@

On 2015/09/24 05:53, Incnis Mrsi wrote:
> Hello.
>
> I hereby inform the WWW Consortium about the thing I deem a bug present
> in several modern mainstream browsers. This bug causes browsers, under
> certain conditions, interpret a perfectly well-formed HTTP/1.1
> text/plain document incorrectly, defying the protocol’s specification.
>
>
>      1. Background
>
> In 1997, HTTP/1.1 was published that mandated explicit “charset=” for
> text documents having *any* charset not a subset of ISO/IEC 8859-1.
> This was reiterated in 2000 (by RFC 2616), but low awareness of httpd
> admins caused the WWW Consortium to publish a statement
> http://www.w3.org/TR/html4/charset.html#h-5.2.2 that virtually
> invalidates provisions of the section 3.7.1. of RFC 2616 with respect to
> HTML documents (text/html).

Yes, the iso-8859-1 'default' was invalidated because there were 
millions and millions of documents for which it would have been wrong, 
especially in Eastern Europe and Asia.

> When out-of-band data are absent (such as in files, usually),
> determining character encoding of a byte stream becomes a problem.
> Specifying the name of encoding in HTML files is an established
> practice. But text/plain documents lack any such workaround. For
> _Unicode_ encoding, sniffing (heuristic determination) between
> UTF-8/UTF-16LE/UTF-16BE is fairly reliable when BOM is placed at the
> beginning. It’s not the case without BOM. Neither can one reliably guess
> whether the document in an unspecified encoding is Unicode, even if BOM
> is expected in Unicode cases.

Even without the BOM, its is still fairly reliable. For some details, 
please see http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf.


> In 2014, HTML5 specification was approved. In section 8.2.2.
> http://www.w3.org/TR/html5/syntax.html#the-input-byte-stream it
> specified so named “encoding sniffing algorithm” that, evidently,
> reflects an increasing share of Unicode-encoded HTMLs. Note that an HTML
> document is a very special case of a text document, if only because the
> first HTML’s character is normally “<”. Practices beneficial for HTML
> would not necessarily be successful for other “text/” subtypes, but
> wait… where the specification claims that _other_ text types should
> follow it? AFAIK there is no such claim.

I also haven't found it, but I haven't had time to look very thoroughly.


>      2. Facts
>
> I prepared a simple test case at http://course.irccity.ru/ya-yu-9-amp.txt .
> Let’s see what lies inside (with netcat or other low-level tool):
>
> GET /ya-yu-9-amp.txt HTTP/1.1
> Host: course.irccity.ru
>
> HTTP/1.1 200 OK
> Server: nginx/1.6.2
> Date: Wed, 23 Sep 2015 19:21:41 GMT
> Content-Type: text/plain; charset=Windows-1251
> Content-Length: 4
> Connection: keep-alive
> Last-Modified: Wed, 23 Sep 2015 08:50:48 GMT
> ETag: "62432b5-4-5206635fbca00"
> Accept-Ranges: bytes
> Content-Language: ru
>
> яю9&
>
> In words, we see a HTTP/1.1-compliant text/plain document encoded in
> Windows-1251 and containing exactly four characters: Cyrillic letters
> “ya” and “yu” (lowercase), ASCII digit nine, and ampersand character.
>
> MS Internet Explorer shows Unicode frown face instead. Unsurprisingly,
> since Bush hid the facts. But what do show Google Chrome and Firefox?
> The same thing.

This is a cute, but nonsensical, example.


>      3. Analysis
>
> Although ya-yu-9-amp.txt isn’t HTML, and doesn’t pretend to be HTML in
> any reasonable way, browsers in question evidently apply so named
> “encoding sniffing”. Look:
> • The document is explicity labelled as “text/plain”.
> • The document doesn’t contain a single “<” (in any interpretation).
> • The document is too short to be a HTML.
>
> Although “text/html” is the most important of “text/” media types in the
> WWW, webmasters using subtypes other than “html” are expected to be
> qualified enough to supply connect HTTP headers with them; the history
> showed that it’s not the case for “text/html”.
> Overriding HTTP/1.1-compliant behaviour opens a possibility of
> unexplainable data losses with “text/” other than “html”, with
> probability about 1/25032 (40 ppm) for a uniformly distributed non-C0
> (i.e. \040–\377) octet stream; see the table under the point 3. of the
> “encoding sniffing algorithm”.
>
> We can now guess that, when applying the point 3. of the algorithm
> (that, as stated by the spec, “is a willful violation of the HTTP
> specification”) browsers forget to check whether document type is HTML.
> Or, possibly, don’t care about Content-Type at all, that may open even
> broader entrance for glitches and expliots.

I don't think that interpreting a text/plain document with a wrong 
character encoding can lead to exploits, because text/plain content 
isn't executed in any way.

>      4. Conclusion
>
> A small modification of browsers’ logic (namely, executing pp. 2., 3.
> conditionally for HTML media type(s) only) could fix the bug. But it
> would be helpful to have an official W3 statement, something like “don’t
> ignore Content-Type, and use Unicode sniffing for HTML only”, before
> submitting bug reports to developers.

I would just go ahead and submit this as a bug. However, I think it 
would be good to have a better example text. The current example text is 
just nonsense, as far as I understand.

Regards,   Martin.

> Regards, Incnis Mrsi
>
>
> .
>
Received on Friday, 25 September 2015 05:53:33 UTC