- From: Incnis Mrsi <browser@superstructure.info>
- Date: Sat, 26 Sep 2015 01:33:33 +0300
- To: www-international@w3.org
On 2015/09/25, Martin J. Dürst wrote: > Even without the BOM, its is still fairly reliable. For some details, > please see http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf. Ironically, I (Incnis Mrsi) was _already_ aware about UTF-related probabilistic considerations, see https://en.wikipedia.org/w/index.php?title=UTF-8&diff=475553555&oldid=475547847 and https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_3#Three-_and_four-byte_sequences But these theoretical estimates have nothing to do with modus operandi of such modern browsers as Firefox 40, that try to read as UTF-8 _any_ octet stream starting from «\357\273\277», even if it’s *obviously not* UTF-8. See below for links to test cases. > > browsers forget to check whether document type is HTML. > > Or, possibly, don’t care about Content-Type at all, > > that may open even broader entrance for glitches and exploits. > I don't think that interpreting a text/plain document with a wrong > character encoding can lead to exploits, because text/plain content > isn't executed in any way. First of all, my main concern is blatant disrespect to Content-Type *at all.* If we can’t trust browsers anymore in this respect, then virtually any document served by a httpd may become harmful. Also, possibly there are “text/” subtypes more dangerous than “plain”. Second, text/plain isn’t executed (right), but can lead to exploits anyway. Imagine a content management system that allows multiple users to submit their content in text/plain, composes it in some way, and serves it also in text/plain. Not necessarily a backwards UI since it can use frames (see below, again). Suppose the webmaster fixed the codepage on something like Windows-1251 that inherently filters out such things as RTL marks, invisible non-joiners, and other potentially glitchy content. Then any user who can inject his/her line of text at the beginning of the text can just submit someting starting from “яю” and all the text becomes mojibake for users of popular browsers. Attention: not only the chunk submitted by this user. But entire text/plain document composed by this theorized CMS. If exploits are only things that grant shell access on the server, then it isn’t an exploit, sure ☺ > However, I think it would be good to have a better example text. Not only an example text, but now a small but functioning Web application, serving a reasonable purpose, at http://www.superstructure.info/browser/compromised/toxic-sniffing.html#better If some guy for some reason wanted to list a codepage’s characters from the last one (and without spaces), i.e. «\377\376\375\374…», then he would immediately run into this bug and see an iframe full of PUA rectangles and other mojibake. Regards, Incnis Mrsi
Received on Friday, 25 September 2015 22:33:32 UTC