Re: Issue 100 Zero-Edits Counter Proposal from Tab Atkins Jr. on 2010-07-15 (public-html@w3.org from July 2010)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Thu, 15 Jul 2010 11:08:25 -0700
To: Maciej Stachowiak <mjs@apple.com>
Cc: HTMLWG WG <public-html@w3.org>, Sam Ruby <rubys@intertwingly.net>
Message-ID: <AANLkTimkbFW8yhcrVf3F8hTXr38kVGnS_QRqyu-LfuEs@mail.gmail.com>
I have updated my counterproposal for Issue 100.  As before, it can be
found in a convenient HTML format at http://www.xanthir.com/:4k or
viewed in plaintext below:

Issue 100 Counter-Proposal
==========================

Summary
-------

There is no problem, and no change should be made to the spec.

Rationale
---------

There are multiple uses for inserting user-provided content into a
page, with notable examples being blog comments, social network
updates, and wiki pages.  A naive implementation of this feature
exposes users to the risk of being attacked by malicious users
inserting, for example, `<script>` tags linking to
information-stealing scripts.  Because of this, a multitude of "HTML
Sanitizers" have been created to "clean" user-provided content and
make it safe to display to other users.  However, these sanitizers are
often incomplete or buggy, as there are many unexpected dark corners
of HTML and script parsing that the authors of the sanitizers are
often not aware of.

HTML5 provides a particular defense against some of these types of
attacks via the @sandbox attribute on the `<iframe>` element.  With
@sandbox, authors can selectively disable scripts, force the document
to run in a unique origin, and other things that are useful for
securing the content.  Using this ability for the aforementioned
use-cases is very attractive, but using an `<iframe>` is not -
incurring an additional network request for every comment on a blog,
for example, would produce an unacceptable delay in a page.

On the list, several suggestions were made for ways to securely embed
user-provided content directly into a page and then benefit from the
sandbox security model:

1. A `<sandbox>` tag.  This fails because an attacker could easily
embed `</sandbox>` in their content to break out of the sandbox, and
it is non-trivial to escape all syntactic variations of `</sandbox>`
within content.  Further, if they do this incorrectly, they won't know
until someone attacks them.

2. A `<sandbox>` tag with a @length attribute, giving the expected
length of the content.  This fails because differences in encodings
(which most authors are not aware of, and many programming languages
don't treat sanely either) can easily result in under/over-estimating
the length, which can then be exploited by an attacker to push code
out of the sandbox.  Further, if they do this incorrectly, they won't
know until someone attacks them (probably - depending on the exact
details it may fail more often, but given that utf-8 and ascii
coincide, it is likely that most english pages won't fail even even
with gross handling mistakes until they get attacked).

3. A `<sandbox>` tag with a @guard attribute, which contains a
reasonably long random string, which is then repeated on `</sandbox>`.
 This fails because adding attributes to end tags is unprecedented and
won't ever work in XML (though there are ways around this).  More
importantly, though, most authors are *horrible* at randomness.  Long
experience shows that a large percentage of authors will copy-paste
the random string used in examples in a tutorial, completely defeating
the feature.  Another large percentage will likely use a random
generator that is too weak, or do something that looks "random enough"
like hashing the current timestamp.  Further, if they do this
incorrectly, they won't know until someone attacks them.  (This and
the previous two also have bad fallback - losing all security in
legacy browsers - unless you take additional measures which present
authors with more places to get things wrong.)

4. A `<sandbox>` tag that contains the user-provided content
base64-encoded, or otherwise encoded in a way that is completely safe
for embedding in an attribute and which can't possible be interpreted
as valid code by a legacy browser.  This is an unsatisfactory solution
as base-64 encoding increases the size of the content considerably.
Further, this renders the content *completely* opaque to a casual
inspection, which is an antipattern for the web.

5. An `<iframe>` tag with a data: url in the @src attribute containing
the user-provided content.  This proposal is unsatisfactory as the
escaping requirements of data: urls are non-trivial.  Most languages
intended for web use will provided an appropriate escaping function,
but it is easy to use a lesser escaping function that appears to work
in simple cases.  For example, PHP provides the urlencode() function,
which is *not* fully correct for data: url escaping, but will often
work - one must realize that there is a second url escaping function,
rawurlencode(), and that it is better for this.

The @srcdoc suggestion was offered as an improvement over all of these
proposals.  It is an additional attribute on `<iframe>`.  It is
roughly similar to the data: url suggestion, with several
improvements.  Namely, the escaping requirements become trivial - for
security purposes, the author only has to escape either " or ',
whichever they are using as their attribute quote character.  Escaping
& is also necessary, but not for security purposes; leaving it off
will just occasionally slightly malform the content as bits of content
get interpreted as named character escapes.  Further, as @srcdoc's
entire reason for existence is to be used with @sandbox, it is nearly
certain to be implemented only when @sandbox is already implemented,
whereas data: urls are usable in legacy browsers that do not implement
the sandbox security model, possibly exposing users of the legacy
browsers to attack.  As well, when @srcdoc is used @src is still
available to be used to deliver a message to legacy user agents.

Several rationales are given in the Issue 100 Change Proposal for
removing @srcdoc:

1. @srcdoc doesn't provide adequate protection

2. @srcdoc escaping requirements are difficult

3. @srcdoc has bad fallback

4. There are existing alternatives to @srcdoc

5. @srcdoc is unneeded by the blogging community

### @srcdoc doesn't provide adequate protection ###

This objection is irrelevant for multiple reasons.

First, this is not an objection against @srcdoc.  @srcdoc is a
convenient way to get content to interact with the sandbox security
model, nothing more.  If the sandbox security model doesn't provide
adequate protection, failures should be raised as bugs against it
specifically.  Changing or removing @srcdoc will have no effect on the
reliability of the sandbox security model.

Second, the types of things that were listed as not being protected
against, such as SQL injection, are entirely outside the scope of
HTML.  **No technology within HTML can possibly address them.**
Preventing an injection attack against your database, for example, is
the responsibility of the database itself, or of the language
interfacing with that database.

### @srcdoc escaping requirements are difficult ###

When used in an HTML page, the escaping requirements for @srcdoc are
trivial.  You have to replace " with `"` and & with `&`.  The latter
has no security implications if it's forgotten; it's merely to prevent
words following an & from accidentally being interpreted as entity
references.  The former is important for security, but it should also
fail very quickly and very obviously if it is left out - the very
first post containing an unescaped " (and thus truncating itself and
dumping the rest of the contents into the element's tag directly) will
make it painfully obvious both that there is a problem and how to
solve it.

When used in an XHTML page, the escaping requirements may be slightly
more involved.  If so, then it is a weakness of XML, not of @srcdoc.
In any case, Issue 103 apparently resolves the issue adequately, by
specifying exactly what additional characters need to be escaped for
@srcdoc to be safely used in XML.  (Note: I'm not sure how many, if
any, of these additional characters are necessary to escape for
security purposes, and how many just need to be escaped to ensure
adequate display of the content.)

### @srcdoc has bad fallback ###

This objection has multiple levels.

First, in a browser which doesn't understand @srcdoc at all, the
`<iframe>`'s @src attribute is instead used to obtain the contents for
the frame.  This is, in general, good fallback behavior - @srcdoc is
intended to be identical to using @src, just without the additional
network request.

The second level, again, has to do with the sandbox security model
itself, and thus has nothing to do with @srcdoc itself.  In browsers
which also don't understand @sandbox, the @src fallback will execute
in an un-sandboxed environment.  As well, the entire sandbox security
model can be bypassed if the attacker can have the user visit the
content's URL directly.  This is valid.  There are two possible ways
around it:

1. Don't fallback at all - have the document pointed to by @src be an
author-generated message that the browser the user is using doesn't
support secure content.

2. Use the text/html-sandboxed mime type to serve the document pointed
to by @src.  This will fail in the proper way (the page will not be
displayed at all) in legacy browsers that don't understand @sandbox.
In newer browsers that understand @sandbox but not @srcdoc, or when
the user visits the url of the content directly in a browsers that
understand @sandbox, the page will be displayed with the sandbox
security model in place.

### There are existing alternatives to @srcdoc ###

There were many alternatives proposed to @srcdoc in the discussion
threads surrounding and preceding it.  The one that is most promising
is to simply use a data: url in @src.  This has a few problems that
make it inferior to @srcdoc:

1. data: urls have more complex escaping requirements than @srcdoc.
All major web languages do provide an escaping function appropriate
for urls, but it is easy to accidentally choose the wrong function.
For example, in PHP the correct function to use is rawurlencode(), but
the function urlencode() may be accidentally used instead.  In
addition, despite both of these functions existing in PHP, multiple
homebrew url-escaping functions can be found across the web, which may
not escape everything that is necessary to escape.  Some of these
lapses may result in non-obvious security holes that can be exploited
by attackers, allowing arbitrary code injection into a web page.

2. In legacy browsers, data: urls will "fail open"; that is, they will
display their contents even if the browser does not understand the
sandbox security model, potentially exposing users to attack.  This
can be mitigated by specifying a text/html-sandboxed mime type in the
data: url, however.

3. As the data: url would be used in @src, there is no capability to
fall back to another message if the browser does not understand the
sandbox security model.

4. data: urls are usually interpreted to be a unique origin by
default, for security.  It is possible that the `allow-same-origin`
flag in @sandbox could be used to indicate that the data: url should
be given the same origin as the outer page, but this would further
complicate the already-confusing rules about when a data: url is
same-origin and when it is unique-origin.

### @srcdoc is unneeded by the blogging community ###

The creator of Wordpress, Matt Mullenweg, was asked about the need for
@srcdoc in the Wordpress software.  He responded that Wordpress
maintains a sanitation library that appears to work adequately.

This is, again, not an argument against @srcdoc, it is an argument
against the sandbox security model.

### Summary ###

Most of the objections listed in the Change Proposal were completely
irrelevant to the actual issue.  They are concerns with the sandbox
security model itself.  @srcdoc is merely a convenient way to opt-in
to the sandbox security model without incurring a network request each
time.

The objection concerning escaping requirements appears to be answered
adequately by the Issue 103 change proposal.

The objection concerning fallback is invalid, given the addition of
the text/html-sandboxed MIME type.

The objection concerning alternate solutions has been shown to be
incorrect, as the best alternative solution, data: urls in @src, is
still inferior to @srcdoc on several points.

Details
-------

No change is made to the spec.

Impact
------

### Positive

* Authors are able to utilize the sandbox security model provided by
`<iframe>`s without incurring the cost of multiple network requests.
* @srcdoc offers the simplest, hardest-to-misuse model for embedding
untrusted content into a webpage.

### Negative

* As with all new elements and attributes, implementing this requires
effort from implementors.

~TJ
Received on Thursday, 15 July 2010 18:09:21 UTC