W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2013

Re: Re: E4H and constructing DOMs

From: Adam Barth <w3c@adambarth.com>
Date: Thu, 7 Mar 2013 17:37:07 -0800
Message-ID: <CAJE5ia-=FfJ_yfQXA_gh511MHhT8OL7sj8Tom62g57Rosg8q4g@mail.gmail.com>
To: mikesamuel <mikesamuel@gmail.com>
Cc: Brendan Eich <brendan@secure.meer.net>, Ian Hickson <ian@hixie.ch>, Rick Waldron <waldron.rick@gmail.com>, Ojan Vafai <ojan@chromium.org>, "rafaelw@chromium.org" <rafaelw@chromium.org>, Adam Klein <adamk@chromium.org>, Anne van Kesteren <annevk@annevk.nl>, Alex Russell <slightlyoff@chromium.org>, "public-script-coord@w3.org" <public-script-coord@w3.org>
On Thu, Mar 7, 2013 at 5:18 PM, Adam Barth <w3c@adambarth.com> wrote:
> I don't think I fully understood your message because it was quite
> long and contained many complex external references.  What I've
> understood you to say is that you've managed to work around the
> limitations of the current string-based template design by building a
> complex mechanism for automatically escaping untrusted data.

As an example, in browsing the source code of the autoescaping code
you referenced, I found the following line:

var HTML_TAG_REGEX_ = /<(?:!|\/?[a-z])(?:[^>'"]|"[^"]*"|'[^']*')*>/gi;

As famously written on Stack Overflow [1], "Regex is not a tool that
can be used to correctly parse HTML."

In any case, we shouldn't require folks to write a thousand lines of
JavaScript to use ECMAScript templates to safely produce HTML.  That's
a clear signal that we should revisit the design of the template


[1] http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

> Rather than forcing authors to layer complex (and therefore
> error-prone) systems on top of a string-based template system, we
> should instead provide authors with an AST-based template system that
> avoids these security pitfalls.
> Adam
> On Thu, Mar 7, 2013 at 5:02 PM, Mike Samuel <mikesamuel@gmail.com> wrote:
>> Adam,
>> I wrote some of the string template proposal, rewrote the template
>> system that Google+ used to take the burden of XSS safety off app
>> developers' shoulders, and more generally work on programming-language
>> & tool approaches to software security.
>> On Thu, 7 Mar 2013 Adam Barth said
>>> The general problem with template strings is that they're an XSS risk.
>>>  Essentially, we're encouraging authors to mix untrusted data into
>>> strings that will later be parsed by the HTML parser.  If the attacker
>>> is clever in selecting these untrusted strings, he'll be able to cause
>>> the remainder of the string to be parsed differently than the author
>>> intends.
>> Are you familiar with
>> https://js-quasis-libraries-and-repl.googlecode.com/svn/trunk/index.html
>> ?
>> The "Safe HTML with bad inputs" example shows contextual auto-escaping
>> using string templates.
>>> var firstName = [...];
>>> var lastName = [...];
>>> header.innerHTML = `<h1>Welcome ${ firstName } ${ lastName }!</h1>`;
>>> If firstName and lastName are are user-controlled (i.e., untrusted),
>>> the above is an XSS vulnerability.  For example, the attacker can set
>>> firstName to "<img onerror='alert(/pwned/)'>".
>> I strongly agree that safety should be the default.
>> I would very much like the default to be overridable to be a late
>> binding producer of string like values that distinguishes trusted
>> substrings so that they can be auto-escaped based on context as
>> described at http://google-caja.googlecode.com/svn/changes/mikesamuel/string-interpolation-29-Jan-2008/trunk/src/js/com/google/caja/interp/index.html
>> I think targeting popular libraries is the best way to get this, since
>> one typicalls wants to be able to push a new version of security
>> sensitive code more quickly than one pushes new language
>> specifications.
>>> We have lots of implementation experience with these sorts of
>>> string-based template systems because they're widely used in languages
>>> like PHP.  Our broad experience is that they lead to buggy, XSS-prone
>>> code.
>>> The general anti-pattern to avoid is the following:
>>>     template + input ->  string ->  HTML parser ->  DOM
>>> A more secure approach is to first parse the template into a DOM and
>>> then add the untrusted input into the DOM as text nodes.  In this
>>> approach, the attacker's maliciously crafted firstName would simply
>>> end up as a text node and would not execute as script.  (You might or
>>> might not like other aspects of E4H, but one of its virtues is that it
>>> follows this more secure pattern.)
>> The DOM approach suffers several drawback
>> 1. It's resistant to XSS but not robust since it doesn't deal with
>> embedded languages.  It trivially fails when substitutions appear
>> inside URI attributes, or text nodes inside a script or style
>> attribute.
>> 2. It's tied to a particular language.  If we wouldn't introduce new
>> syntax specifically for SQL prepared statements, we shouldn't do it
>> for the HTML equivalent and instead come up with a single syntactic
>> construct that allows safe composition in any language.
>> 3. It fails the ubiquitous <header><body><footer> pattern as described
>> at https://js-quasis-libraries-and-repl.googlecode.com/svn/trunk/safetemplate.html
>> The DOM approach can be generalized to a parse-tree approach to solve
>> embedded languages as done by Yesod (
>> http://yannesposito.com/Scratch/en/blog/Yesod-tutorial-for-newbies/#bulletproof
>> ).
>> Yesod and similar approaches don't provide a good migration target for
>> existing ad-hoc composition methods and at the end of this email, I
>> include a mini-progress report on my attempt to comprehensively
>> address content-composition in a way that I believe is much easier to
>> use than Yesod.  I believe Yesod also requires significant
>> per-content-language work in the type-system and in hand-written
>> encoders, and would be impossible to port from Haskell to stringly
>> typed code.
>>> I understand that someone (either the author or the browser) could
>>> write an HTML tag for template strings that implements the more secure
>> Already done.  See link above.
>>> pattern, but most authors will simply use the default mode, which
>>> follows the insecure pattern.  As a result, this language feature will
>>> lead to many XSS vulnerabilities and general sadness in the world.
>> I disagree.  Without this, people will continue to use
>>    header.innerHTML = "<h1>Welcome " + firstName + " " +  lastName + "!</h1>";
>> leading to great sadness, or if templates are based on the HTML DOM,
>> we will just have other injection attacks instead still leading to
>> general sadness.
>> XSS is a special case of code injection, so to avoid "general sadness"
>> we need to generalize to we need a principled approach to code
>> injection that
>> 1. deals with embedded languages
>> 2. deals with multiple host languages, not just HTML
>> 3. involves language definers in safe composition without bloating
>> language specifications
>> 4. provides a path to provable safety from injection for those who
>> want to spend the time constructing the proofs
>> https://www.usenix.org/lets-parse-prevent-pwnage ourlines Úlfar and my
>> attempt to provide such a solution.  The basic idea is that we take a
>> language grammar like :
>>     HTMLTextNode := ([^<&] | CharacterReference)+;
>>     CharacterReference := "&lt;" | "&gt;" | ... | "&#" ([0-9]+) ";" | ...;
>> and add annotations that explain the relationship between substrings and data:
>>    HTMLTextNode := @String (@Char [^<&] | CharacterReference)+
>>    CharacterReference := @Char{"<"} "&lt;" | @Char{">"} "&gt;" | ... |
>> "&#" (@ScalarCharValue [0-9]+) ";";
>> From such annotated grammars, we can generate code for encoders,
>> decoders, sanitizers, and template context functions in library
>> languages.
>> I've got the encoder generator stuff done, have implemented VMs for
>> the decoders, sanitizers, and am finishing up the template context
>> functions.
>> I have some experience writing, maintaining and debugging such
>> grammars and am confident that the basic approach is workable.
>> Once I've done that, I hope to write code-generator backends for JS,
>> Java, Rust, Python.
>> Then, using a combination of syntactic plug-in points like JS string
>> templates, and Python style % operator overloading, I hope to make
>> syntactically sugary and safe composition ubiquitously available so
>> that the app-developer community will have as easy an answer to
>> code-injection analogous to the "just use prepared statements" that is
>> widely dispensed for ad-hoc SQL query creation.
Received on Friday, 8 March 2013 01:38:08 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 8 May 2013 19:30:09 UTC