W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2013

Re: E4H and constructing DOMs

From: Maciej Stachowiak <mjs@apple.com>
Date: Thu, 07 Mar 2013 20:46:29 -0800
Cc: Jonas Sicking <jonas@sicking.cc>, Mike Samuel <mikesamuel@gmail.com>, "public-script-coord@w3.org" <public-script-coord@w3.org>
Message-id: <5A6ED7F3-6E7A-4F7C-8EC8-EF6914E85984@apple.com>
To: "Mark S. Miller" <erights@google.com>


I strongly suspect there are more bugs than the one I found, as the regexp looks way too simple to capture the full behavior of the relevant HTML tokenizer states. Regrettably I do not have the time or expertise to hunt for more.


On Mar 7, 2013, at 8:36 PM, "Mark S. Miller" <erights@google.com> wrote:

> Hi Maciej,
> Please report it at https://code.google.com/p/google-caja/issues/list and select the Template: Private Issue.
> Thanks!
> On Thu, Mar 7, 2013 at 8:22 PM, Maciej Stachowiak <mjs@apple.com> wrote:
> On Mar 7, 2013, at 7:57 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> On Thu, Mar 7, 2013 at 5:55 PM, Mike Samuel <mikesamuel@gmail.com> wrote:
>>> 2013/3/7 Adam Barth <w3c@adambarth.com>:
>>>> On Thu, Mar 7, 2013 at 5:18 PM, Adam Barth <w3c@adambarth.com> wrote:
>>>>> I don't think I fully understood your message because it was quite
>>>>> long and contained many complex external references.  What I've
>>>>> understood you to say is that you've managed to work around the
>>>>> limitations of the current string-based template design by building a
>>>>> complex mechanism for automatically escaping untrusted data.
>>>> As an example, in browsing the source code of the autoescaping code
>>>> you referenced, I found the following line:
>>>> var HTML_TAG_REGEX_ = /<(?:!|\/?[a-z])(?:[^>'"]|"[^"]*"|'[^']*')*>/gi;
>>>> As famously written on Stack Overflow [1], "Regex is not a tool that
>>>> can be used to correctly parse HTML."
>>> That doesn't apply since this is not parsing, it is lexing, and
>>> regular expressions can be used to lex HTML.
>> Actually, no you can't. For example the lexing of contents of <script>
>> elements is quite complex.
> For further reference, tokenizing HTML looks like this: <http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization>.
> It superficially looks like an FSM, so it seems tempting to process it with a regexp, but interaction with tree construction makes it non-regular.
> Even if you ignore the non-regular bits, translating it to a regexp is hard. For example, with a few minutes study I found a string that the HTML spec and all browsers treat as an HTML open tag which is not matched by the regexp that Adam quoted. I assume this is likely a security flaw in the library it comes from. I am not sure if it's ok to post bug reports here or if there is some private channel to disclose the security bug; I'll gladly report it if someone tells me how.
> Regards,
> Maciej
> -- 
>     Cheers,
>     --MarkM

Received on Friday, 8 March 2013 04:47:55 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:14:08 UTC