[whatwg] Content Restrictions from Alexey Feldgendler on 2006-03-09 (public-whatwg-archive@w3.org from March 2006)

From: Alexey Feldgendler <alexey@feldgendler.ru>
Date: Thu, 09 Mar 2006 22:57:31 +0600
Message-ID: <op.s55n15dp1h6og4@pancake.feldgendler.ru>
On Mon, 06 Mar 2006 16:48:08 +0600, Gervase Markham <gerv at mozilla.org>  
wrote:

>> I never said that the website won't have to do HTML cleaning for
>> user-supplied content. But with HTML 5 reference parsing algorithm, such
>> cleaning is going to be much easier and straightforward: parse the text
>> into DOM (as if it was inside BODY, for example), remove or modify
>> forbidden elements, then serialize it. That way, </SANDBOX> will be
>> ignored as an easy parse error because it doesn't match an opening tag
>> within the user-supplied text. An unclosed comment will be ignored, too.

> Er, what defines "the user-supplied content"? Surely it's the <SANDBOX>
> tags? So how can you say "A </SANDBOX> inside the user-supplied content
> will be ignored", as you don't know whether a </SANDBOX> you encounter
> is the end of the sandbox or not?
>
> Or are you suggesting that only one sandbox per page is allowed, and the
> user agent should use the outermost </SANDBOX> tag?

It's my fault, I just didn't make it clear enough. Here is the scenario I  
was keeping in mind.

Let's imagine a blogging website that allows anybody to create a blog  
which is available as http://www.example.com/blogs/username/. Many such  
sites allow various user customization, so imagine this site lets the blog  
owner to supply custom HTML to display on top of the blog page. This is  
primarily used by blog authors to design stylish navigation. To make such  
navigation menus more attractive, the authors wish to use JavaScript and  
Flash, but unrestricted JavaScript would make it possible for the blog  
owner to steal visitors' session cookies.

The blog author logs in and opens some kind of customization screen:

HTML to display on top of your blog: [TEXTAREA]
[SUBMIT]

So, imagine the blog author enters into the textarea:

Welcome to my blog!</sandbox><a href="#"  
onclick="alert(document.cookie)">Click here</a>

After submission, this code is fed to the HTML cleaner. At present, HTML  
cleaners are usually complicated scripts which try to catch known quirks  
of the user agents, and still they usually have security holes found one  
after another. See for example  
http://cvs.livejournal.org/browse.cgi/livejournal/cgi-bin/cleanhtml.pl.  
With HTML 5 parsing spec, there will be one single algorithm for parsing  
HTML code with well-defined error recovery. So, the HTML cleaner at the  
server side runs the HTML 5 parser on the user-supplied text, which  
produces the following DOM:

* Welcome to my blog!
* A
     href="#"
     onclick="alert(document.cookie)"
   * Click here

The </sandbox> tag is ignored as an easy parse error because there is no  
matching <sandbox> tag in the user-supplied text. After parsing, the HTML  
cleaner iterates through the tree, renaming potentially unsafe elements  
and attributes, producing the following:

* Welcome to my blog!
* A
     href="#"
     safe-onclick="alert(document.cookie)"
   * Click here

At the final stage, the HTML cleaner re-serializes the DOM into the  
following code, which is saved into the database:

Welcome to my blog!<a href="#" safe-onclick="alert(document.cookie)">Click  
here</a>

When the site renders the blog page, it puts the "HTML for page top"  
inside a sandbox:

<body>
<sandbox>
Welcome to my blog!<a href="#" safe-onclick="alert(document.cookie)">Click  
here</a>
</sandbox>
...
</body>

Each blog entry is probably also contained in its own sandbox. This is  
even more important on the so-called friends pages, where entries by  
different authors are displayed on the same page.

When the page is rendered in a modern user agent which supports  
sandboxing, the safe-onclick attribute is interpreted exactly the same as  
onclick. When the user clicks the link, the event handler is executed.  
Because the code is inside the sandbox, it operates on a fake document  
object, so it doesn't retrieve the cookies (I think document.cookie should  
just return an empty string). The visitor's session cookies are safe.

When the page is rendered in an older user agent which doesn't support  
sandboxing, the safe-onclick attribute is ignored because it is unknown.  
When the user clicks the link, no event handler is executed, and the  
cookies are safe again.


-- 
Opera M2 8.5 on Debian Linux 2.6.12-1-k7
* Origin: X-Man's Station [ICQ: 115226275] <alexey at feldgendler.ru>
Received on Thursday, 9 March 2006 08:57:31 UTC