Re: May HTML5 mandate interpretation of text/plain? from Incnis Mrsi on 2015-09-25 (www-international@w3.org from July to September 2015)

From: Incnis Mrsi <browser@superstructure.info>
Date: Sat, 26 Sep 2015 01:33:33 +0300
To: www-international@w3.org
Message-ID: <HTML5vsTextPlain.cmt1.browser@superstructure.info>

On 2015/09/25, Martin J. Dürst wrote:
> Even without the BOM, its is still fairly reliable. For some details, 
> please see http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf.
Ironically, I (Incnis Mrsi) was _already_ aware about UTF-related probabilistic considerations, 
see https://en.wikipedia.org/w/index.php?title=UTF-8&diff=475553555&oldid=475547847
and https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_3#Three-_and_four-byte_sequences
But these theoretical estimates have nothing to do with modus operandi 
of such modern browsers as Firefox 40, that try to read as UTF-8 
_any_ octet stream starting from «\357\273\277», 
even if it’s *obviously not* UTF-8.
See below for links to test cases.

> > browsers forget to check whether document type is HTML. 
> > Or, possibly, don’t care about Content-Type at all, 
> > that may open even broader entrance for glitches and exploits.
> I don't think that interpreting a text/plain document with a wrong 
> character encoding can lead to exploits, because text/plain content 
> isn't executed in any way.

First of all, my main concern is blatant disrespect to Content-Type *at all.* 
If we can’t trust browsers anymore in this respect, 
then virtually any document served by a httpd may become harmful.
Also, possibly there are “text/” subtypes more dangerous than “plain”. 

Second, text/plain isn’t executed (right), but can lead to exploits anyway.
Imagine a content management system that allows 
multiple users to submit their content in text/plain, composes it in some way, 
and serves it also in text/plain.
Not necessarily a backwards UI since it can use frames (see below, again).
Suppose the webmaster fixed the codepage on something like Windows-1251
that inherently filters out such things as RTL marks, invisible non-joiners, and other
potentially glitchy content.

Then any user who can inject his/her line of text at the beginning of the text 
can just submit someting starting from “яю” and 
all the text becomes mojibake for users of popular browsers. 
Attention: not only the chunk submitted by this user. 
But entire text/plain document composed by this theorized CMS. 
If exploits are only things that grant shell access on the server, 
then it isn’t an exploit, sure ☺

> However, I think it would be good to have a better example text.

Not only an example text, but now a small but functioning Web application,
serving a reasonable purpose, at http://www.superstructure.info/browser/compromised/toxic-sniffing.html#better
If some guy for some reason wanted to list a codepage’s characters 
from the last one (and without spaces), i.e. «\377\376\375\374…», 
then he would immediately run into this bug and see 
an iframe full of PUA rectangles and other mojibake.

Regards, Incnis Mrsi

Received on Friday, 25 September 2015 22:33:32 UTC