- From: Tab Atkins Jr. <jackalmage@gmail.com>
- Date: Thu, 15 Jul 2010 11:08:25 -0700
- To: Maciej Stachowiak <mjs@apple.com>
- Cc: HTMLWG WG <public-html@w3.org>, Sam Ruby <rubys@intertwingly.net>
I have updated my counterproposal for Issue 100. As before, it can be found in a convenient HTML format at http://www.xanthir.com/:4k or viewed in plaintext below: Issue 100 Counter-Proposal ========================== Summary ------- There is no problem, and no change should be made to the spec. Rationale --------- There are multiple uses for inserting user-provided content into a page, with notable examples being blog comments, social network updates, and wiki pages. A naive implementation of this feature exposes users to the risk of being attacked by malicious users inserting, for example, `<script>` tags linking to information-stealing scripts. Because of this, a multitude of "HTML Sanitizers" have been created to "clean" user-provided content and make it safe to display to other users. However, these sanitizers are often incomplete or buggy, as there are many unexpected dark corners of HTML and script parsing that the authors of the sanitizers are often not aware of. HTML5 provides a particular defense against some of these types of attacks via the @sandbox attribute on the `<iframe>` element. With @sandbox, authors can selectively disable scripts, force the document to run in a unique origin, and other things that are useful for securing the content. Using this ability for the aforementioned use-cases is very attractive, but using an `<iframe>` is not - incurring an additional network request for every comment on a blog, for example, would produce an unacceptable delay in a page. On the list, several suggestions were made for ways to securely embed user-provided content directly into a page and then benefit from the sandbox security model: 1. A `<sandbox>` tag. This fails because an attacker could easily embed `</sandbox>` in their content to break out of the sandbox, and it is non-trivial to escape all syntactic variations of `</sandbox>` within content. Further, if they do this incorrectly, they won't know until someone attacks them. 2. A `<sandbox>` tag with a @length attribute, giving the expected length of the content. This fails because differences in encodings (which most authors are not aware of, and many programming languages don't treat sanely either) can easily result in under/over-estimating the length, which can then be exploited by an attacker to push code out of the sandbox. Further, if they do this incorrectly, they won't know until someone attacks them (probably - depending on the exact details it may fail more often, but given that utf-8 and ascii coincide, it is likely that most english pages won't fail even even with gross handling mistakes until they get attacked). 3. A `<sandbox>` tag with a @guard attribute, which contains a reasonably long random string, which is then repeated on `</sandbox>`. This fails because adding attributes to end tags is unprecedented and won't ever work in XML (though there are ways around this). More importantly, though, most authors are *horrible* at randomness. Long experience shows that a large percentage of authors will copy-paste the random string used in examples in a tutorial, completely defeating the feature. Another large percentage will likely use a random generator that is too weak, or do something that looks "random enough" like hashing the current timestamp. Further, if they do this incorrectly, they won't know until someone attacks them. (This and the previous two also have bad fallback - losing all security in legacy browsers - unless you take additional measures which present authors with more places to get things wrong.) 4. A `<sandbox>` tag that contains the user-provided content base64-encoded, or otherwise encoded in a way that is completely safe for embedding in an attribute and which can't possible be interpreted as valid code by a legacy browser. This is an unsatisfactory solution as base-64 encoding increases the size of the content considerably. Further, this renders the content *completely* opaque to a casual inspection, which is an antipattern for the web. 5. An `<iframe>` tag with a data: url in the @src attribute containing the user-provided content. This proposal is unsatisfactory as the escaping requirements of data: urls are non-trivial. Most languages intended for web use will provided an appropriate escaping function, but it is easy to use a lesser escaping function that appears to work in simple cases. For example, PHP provides the urlencode() function, which is *not* fully correct for data: url escaping, but will often work - one must realize that there is a second url escaping function, rawurlencode(), and that it is better for this. The @srcdoc suggestion was offered as an improvement over all of these proposals. It is an additional attribute on `<iframe>`. It is roughly similar to the data: url suggestion, with several improvements. Namely, the escaping requirements become trivial - for security purposes, the author only has to escape either " or ', whichever they are using as their attribute quote character. Escaping & is also necessary, but not for security purposes; leaving it off will just occasionally slightly malform the content as bits of content get interpreted as named character escapes. Further, as @srcdoc's entire reason for existence is to be used with @sandbox, it is nearly certain to be implemented only when @sandbox is already implemented, whereas data: urls are usable in legacy browsers that do not implement the sandbox security model, possibly exposing users of the legacy browsers to attack. As well, when @srcdoc is used @src is still available to be used to deliver a message to legacy user agents. Several rationales are given in the Issue 100 Change Proposal for removing @srcdoc: 1. @srcdoc doesn't provide adequate protection 2. @srcdoc escaping requirements are difficult 3. @srcdoc has bad fallback 4. There are existing alternatives to @srcdoc 5. @srcdoc is unneeded by the blogging community ### @srcdoc doesn't provide adequate protection ### This objection is irrelevant for multiple reasons. First, this is not an objection against @srcdoc. @srcdoc is a convenient way to get content to interact with the sandbox security model, nothing more. If the sandbox security model doesn't provide adequate protection, failures should be raised as bugs against it specifically. Changing or removing @srcdoc will have no effect on the reliability of the sandbox security model. Second, the types of things that were listed as not being protected against, such as SQL injection, are entirely outside the scope of HTML. **No technology within HTML can possibly address them.** Preventing an injection attack against your database, for example, is the responsibility of the database itself, or of the language interfacing with that database. ### @srcdoc escaping requirements are difficult ### When used in an HTML page, the escaping requirements for @srcdoc are trivial. You have to replace " with `"` and & with `&`. The latter has no security implications if it's forgotten; it's merely to prevent words following an & from accidentally being interpreted as entity references. The former is important for security, but it should also fail very quickly and very obviously if it is left out - the very first post containing an unescaped " (and thus truncating itself and dumping the rest of the contents into the element's tag directly) will make it painfully obvious both that there is a problem and how to solve it. When used in an XHTML page, the escaping requirements may be slightly more involved. If so, then it is a weakness of XML, not of @srcdoc. In any case, Issue 103 apparently resolves the issue adequately, by specifying exactly what additional characters need to be escaped for @srcdoc to be safely used in XML. (Note: I'm not sure how many, if any, of these additional characters are necessary to escape for security purposes, and how many just need to be escaped to ensure adequate display of the content.) ### @srcdoc has bad fallback ### This objection has multiple levels. First, in a browser which doesn't understand @srcdoc at all, the `<iframe>`'s @src attribute is instead used to obtain the contents for the frame. This is, in general, good fallback behavior - @srcdoc is intended to be identical to using @src, just without the additional network request. The second level, again, has to do with the sandbox security model itself, and thus has nothing to do with @srcdoc itself. In browsers which also don't understand @sandbox, the @src fallback will execute in an un-sandboxed environment. As well, the entire sandbox security model can be bypassed if the attacker can have the user visit the content's URL directly. This is valid. There are two possible ways around it: 1. Don't fallback at all - have the document pointed to by @src be an author-generated message that the browser the user is using doesn't support secure content. 2. Use the text/html-sandboxed mime type to serve the document pointed to by @src. This will fail in the proper way (the page will not be displayed at all) in legacy browsers that don't understand @sandbox. In newer browsers that understand @sandbox but not @srcdoc, or when the user visits the url of the content directly in a browsers that understand @sandbox, the page will be displayed with the sandbox security model in place. ### There are existing alternatives to @srcdoc ### There were many alternatives proposed to @srcdoc in the discussion threads surrounding and preceding it. The one that is most promising is to simply use a data: url in @src. This has a few problems that make it inferior to @srcdoc: 1. data: urls have more complex escaping requirements than @srcdoc. All major web languages do provide an escaping function appropriate for urls, but it is easy to accidentally choose the wrong function. For example, in PHP the correct function to use is rawurlencode(), but the function urlencode() may be accidentally used instead. In addition, despite both of these functions existing in PHP, multiple homebrew url-escaping functions can be found across the web, which may not escape everything that is necessary to escape. Some of these lapses may result in non-obvious security holes that can be exploited by attackers, allowing arbitrary code injection into a web page. 2. In legacy browsers, data: urls will "fail open"; that is, they will display their contents even if the browser does not understand the sandbox security model, potentially exposing users to attack. This can be mitigated by specifying a text/html-sandboxed mime type in the data: url, however. 3. As the data: url would be used in @src, there is no capability to fall back to another message if the browser does not understand the sandbox security model. 4. data: urls are usually interpreted to be a unique origin by default, for security. It is possible that the `allow-same-origin` flag in @sandbox could be used to indicate that the data: url should be given the same origin as the outer page, but this would further complicate the already-confusing rules about when a data: url is same-origin and when it is unique-origin. ### @srcdoc is unneeded by the blogging community ### The creator of Wordpress, Matt Mullenweg, was asked about the need for @srcdoc in the Wordpress software. He responded that Wordpress maintains a sanitation library that appears to work adequately. This is, again, not an argument against @srcdoc, it is an argument against the sandbox security model. ### Summary ### Most of the objections listed in the Change Proposal were completely irrelevant to the actual issue. They are concerns with the sandbox security model itself. @srcdoc is merely a convenient way to opt-in to the sandbox security model without incurring a network request each time. The objection concerning escaping requirements appears to be answered adequately by the Issue 103 change proposal. The objection concerning fallback is invalid, given the addition of the text/html-sandboxed MIME type. The objection concerning alternate solutions has been shown to be incorrect, as the best alternative solution, data: urls in @src, is still inferior to @srcdoc on several points. Details ------- No change is made to the spec. Impact ------ ### Positive * Authors are able to utilize the sandbox security model provided by `<iframe>`s without incurring the cost of multiple network requests. * @srcdoc offers the simplest, hardest-to-misuse model for embedding untrusted content into a webpage. ### Negative * As with all new elements and attributes, implementing this requires effort from implementors. ~TJ
Received on Thursday, 15 July 2010 18:09:21 UTC