[whatwg] updateWithSanitizedHTML (was Re: innerStaticHTML) from Adam Barth on 2009-12-01 (public-whatwg-archive@w3.org from December 2009)

From: Adam Barth <whatwg@adambarth.com>
Date: Tue, 1 Dec 2009 00:14:09 -0800
Message-ID: <7789133a0912010014n5c9c599eseee250ec4c532243@mail.gmail.com>
Your main point is well taken.

There are some technical reasons why tag whitelisting makes more sense
for inline content.  For example, consider the case you mentioned on
webkit-dev: @id.  Inline, @id is problematic because the ids exist in
a per-frame namespace, whereas they're harmless when the untrusted
content has an entire iframe to itself.  Of course, @id is not unique
in this respect.  For example, <input type=password> will likely get
autofilled by the password manager inline and @style can be used to
draw all over the page without an iframe's layout contraints.

That said, I'm not married to a design with a tag-level whitelist.  Do
you have a specific alternative in mind?

Adam


On Mon, Nov 30, 2009 at 7:43 PM, Maciej Stachowiak <mjs at apple.com> wrote:
>
> On Nov 30, 2009, at 6:32 PM, Adam Barth wrote:
>
>> On Mon, Nov 30, 2009 at 5:43 PM, Maciej Stachowiak <mjs at apple.com> wrote:
>>>
>>> 1) It seems like this API is harder to use than a sandboxed iframe. To
>>> use
>>> it correctly, you need to determine a whitelist of safe elements and
>>> attributes; providing an explicit whitelist at least of tags is
>>> mandatory.
>>> With a sandboxed iframe, as a Web developer you can just ask the browser
>>> to
>>> turn off unsafe things and not worry about designing a security policy.
>>> Besides ease of use, there is also the concern that a server-side
>>> filtering
>>> whitelist may be buggy, and if you apply the same whitelist on the client
>>> side as backup instead of doing something high level like "disable
>>> scripting" then you are less likely to benefit from defense in depth,
>>> since
>>> you may just replicate the bug.
>>
>> I should follow up with folks in the ruby-on-rails community to see
>> how they view their sanitize API. ?The one person I asked had a
>> positive opinion, but we should get a bigger sample size.
>
> For server-side sanitization, this kind of explicit API is pretty much the
> only thing you can do.
>
>>
>> I think updateWithSanitizedHTML has different use cases than @sandbox.
>> I think the killer applications for @sandbox are advertisements and
>> gadgets. ?In those cases, the developer wants most of the browser's
>> functionality, but wants to turn off some dangerous stuff (like
>> plug-ins). ?For updateWithSanitizedHTML, the killer application is
>> something like blog comments, where you basically want text with some
>> formatting tags (bold, italics, and maybe images depending on the
>> forum).
>
> I can imagine use cases where allowing very open-ended but script-free
> content is desirable. For example, consider a hosted blog service that wants
> to let blog authors write nearly arbitrary HTML, but without allowing
> script. @sandbox would not be a good solution for that use case. In general
> it does not seem sensible to me that the choice of tag whitelisting vs
> high-level feature whitelisting is tied to the choice of embedding content
> directly vs. creating a frame. Is there a technical reason these two choices
> have to be tied?
>
>>
>>> 2) It seems like this API loses one of the big benefits of sanitizing
>>> HTML
>>> in the browser implementation. Specifically, in theory it's safe to say
>>> "allow everything except any construct that would result in script/code
>>> running". You can't do that on the server side - blacklisting is not
>>> sound
>>> because you can't predict the capabilities of all browsers. But the
>>> browser
>>> can predict its own capabilities. Sandboxed iframes do allow for this.
>>
>> The benefit is that you know you're getting the right parsing. ?You're
>> not going to be tripped up by <img/src=javascript: and friends.
>
> It's true, this is a benefit. However, it seems like even if you whitelist
> tags, being able to say "no script" at a high level
>
>> Also, this API is useful in cases where you don't have a server to help
>> you
>> sanitize your input. ?One example I saw recently was a GreaseMonkey
>> script that wanted to add EXIF metadata to Flickr. ?Basically, the
>> script grabbed the EXIF data from api.flickr.com and added it to the
>> current page. ?Unfortunately, that meant I could use this GreaseMonkey
>> script to XSS Flickr by adding HTML to my EXIF metadata. ?Sure, there
>> are other ways of solving the problem (I asked the developer to build
>> the DOM in memory and use innerText), but you want something simple
>> for these cases.
>
> If the EXIF metadata is supposed to be text-only, it seems like
> updateWithSanitizedHTML would not be easier to use than innerText, or in any
> way superior. For cases where it is actually desirable to allow some markup,
> it's not clear to me that giving explicit whitelists of what is allowed is
> the simple choice.
>
>>
>>> I think the benefits of filtering by tag/attribute/scheme for advanced
>>> experts are outweighed by these two disadvantages for basic use, compared
>>> to
>>> something simple like the original staticInnerHTML idea. Another possible
>>> alternative is to express how to sanitize at a higher level, using
>>> something
>>> similar to sandboxed iframe feature strings.
>>
>> If you think of @sandbox as being optimized for rich untrusted content
>> and updateWithSanitizedHTML as being optimized for poor untrusted
>> content, then you'll see that's what the API does already. ?The
>> feature string Slashdot wants for its comments is ("a b strong i em",
>> "href"), but another message board might want something different.
>> For example, 4chan might want ("img", "src alt"). ?I don't think these
>> require particularly advanced experts to understand.
>
> updateWithSanitizedHTML and @sandbox both provide features that the other
> does not for reasons that do not seem technically necessary. For example,
> updateWithSanitizedHTML could easily have an "allow everything except
> script" mode, and @sandbox could easily allow per-tag whitelisting. Then the
> choice would be between the resource cost of a frame, and the sandboxing
> features that it's impractical to provide without a frame (limiting content
> to a bounding box while still allowing styling, allowing script without
> affecting the containing content, etc).
>
>>
>>> Here's a problem that exists with both this API and also innerStaticHTML:
>>>
>>> 3) There is no secure and efficient way to append sanitized contents to
>>> an
>>> element that already has children. This may result in authors appending
>>> with
>>> innerHTML += ?(inefficient and insecure!) or insertAdjecentHTML()
>>> (efficient
>>> but still insecure!). I'm willing to concede that use cases other than
>>> "replace existing contents" and "append to existing contents" are fairly
>>> exotic.
>>
>> Maybe we need insertAdjecentSanitizedHTML instead or in addition. ?;)
>
> Perhaps. The verb "update" is generic enough that it could handle different
> kinds of mutations with flags, but perhaps that means it is too vague for a
> security-sensitive API.
>
> Regards,
> Maciej
>
>
Received on Tuesday, 1 December 2009 00:14:09 UTC