Re: [webcomponents] More backward-compatible templates from Hajime Morrita on 2012-11-01 (public-webapps@w3.org from October to December 2012)

From: Hajime Morrita <morrita@google.com>
Date: Thu, 1 Nov 2012 16:42:45 +0100
To: Adam Barth <w3c@adambarth.com>
Cc: Maciej Stachowiak <mjs@apple.com>, Anne van Kesteren <annevk@annevk.nl>, "public-webapps@w3.org WG" <public-webapps@w3.org>
Message-ID: <CALzNm5qXuVWX+5GqZ6fFAvOEO0A-pX0BRAbe4iKvFS7ZivY_3g@mail.gmail.com>
A naive proposal: Can we introduce an alias of <script> element, like
<scr>, and ask template authors to use <scr> instead of <script> inside
<script template>?  Since <scr> is just an alias of <script>, authors can
use it even outside <script template>.

I guess it won't confuse neither the tokenizer nor existing parser, and it
will be polyfill-friendly. It isn't as clean as <template>, but not as ugly
as <+script> IMO. Obviously "scr" sounds ugly so it is great if it has some
unused sensible name.




On Thu, Nov 1, 2012 at 3:14 PM, Adam Barth <w3c@adambarth.com> wrote:

>
>
>
> On Thu, Nov 1, 2012 at 6:33 AM, Maciej Stachowiak <mjs@apple.com> wrote:
>
>>
>> On Nov 1, 2012, at 1:57 PM, Adam Barth <w3c@adambarth.com> wrote:
>>
>>
>>>
>> (5) The nested template fragment parser operates like the template
>>> fragment parser, but with the following additional difference:
>>>      (a) When a close tag named "+script" is encountered which does not
>>> match any currently open script tag:
>>>
>>
>> Let me try to understand what you've written here concretely:
>>
>> 1) We need to change the "end tag open" state to somehow recognize
>> "</+script>" as an end tag rather than as a bogus comment.
>> 2) When the tree builder encounter such an end tag in the ???? state(s),
>> we execute the substeps you've outlined below.
>>
>> The problem with this approach is that nested templates parse differently
>> than top-level templates.  Consider the following example:
>>
>> <script type=template>
>>  <b
>> </script>
>>
>> In this case, none of the nested template parser modifications apply and
>> we'll parse this as normal for HTML.  That means the contents of the
>> template will be "<b" (let's ignore whitespace for simplicity).
>>
>> <script type=template>
>>   <h1>Inbox</h1>
>>   <script type=template>
>>     <b
>>   </+script>
>>  </script>
>>
>> Unfortunately, the nested template in this example parses differently
>> than it did when it was a top-level template.  The problem is that the
>> characters "</+script>" are not recognized by the tokenizer as an end tag
>> because they are encountered by the nested template fragment parser in the
>> "before attribute name" state.  That means they get treated as some sort of
>> bogus attributes of the <b> tag rather than as an end tag.
>>
>>
>> OK. Do you believe this to be a serious problem? I feel like
>> inconsistency in the case of a malformed tag is not a very important
>> problem, but perhaps there are cases that would be more obviously
>> problematic, or reasons not obvious to me to be very concerned about cases
>> exactly like this one.
>>
>
> It's going to lead to subtle parsing bugs in web sites, which usually
> means security vulnerabilities.  :(
>
> Also: can you think of a way to fix this problem? Or alternately, do you
>> believe it's fundamentally not fixable? I've only spent a short amount of
>> time thinking about this approach, and I am not nearly as much an expert on
>> HTML parsing as you are.
>>
>
> I definitely see the appeal of trying to re-use <script> for templates.
>  Unfortunately, I couldn't figure out how to make it work sensibly with
> nested templates, which is why I ended up recommending that we use the
> <template> element.
>
> Another approach we considered was to separate out the "hide from legacy
> user agents" and the "define a template" operations.  That approach pushes
> you towards a design like
>
> <xmp>
>   <template>
>     <h1>Inbox</h1>
>     <template>
>       <h2>Folder</h2>
>     </template>
>   </template>
> </xmp>
>
> You could do the same thing with <script type=something>, but <xmp> is
> shorter (and currently unused).  This approach has a bunch of
> disadvantages, including being verbose and having some unexpected parsing:
>
> <xmp>
>   <template>
>     <div data-foo="<xmp>bar</xmp>">
>       This text is actually outside the template!
>     </div>
>   </template>
> </xmp>
>
> The <script type=template> has similar problems, of course:
>
> <script type=template>
>   <div data-foo="<script>bar</script>">
>     This text is actually outside the template!
>   </div>
> </script>
>
> Perhaps developers have a clearer understanding of such problems from
> having to escape </script> in JavaScript?
>
> All this goofiness eventually convinced me that if we want to support
> nested templates, we ought to use the usual nesting mechanics of HTML,
> which leads to a design like <template> that nests like a normal tag.
>
>           (a.i) Consume the token for the close tag named "+script".
>>>          (a.ii) Crate a DocumentFragment containing that parsed contents
>>> of the fragment.
>>>          (a.iii) [return to the parent template fragment parser] with
>>> the result of step (a.ii) with the parent parser to resume after the
>>> "+script" close tag.
>>>
>>>
>>> This is pretty rough and I'm sure I got some details wrong. But I
>>> believe it demonstrates the following properties:
>>> (B) Allows for perfect fidelity polyfills, because it will manifestly
>>> end the template in the same place that an unaware browser would close the
>>> <script> element.
>>> (C) Does not require multiple levels of escaping.
>>> (A) Can be implemented without changes to the core HTML parser (though
>>> you'd need to introduce a new fragment parsing mode).
>>>
>>
>> I suspect we're quibbling over "no true Scotsman" semantics here, but you
>> obviously need to modify both the HTML tokenizer and tree builder for this
>> approach to work.
>>
>>
>> In principle you could create a whole separate tokenizer and tree
>> builder. But obviously that would probably be a poor choice for a native
>> implementation compared to adding some flags and variable behavior. I'm not
>> even necessarily claiming that all the above properties are advantages, I
>> just wanted to show that there need not be a multi-escapting problem nor
>> necessarily scary complicated changes to the tokenizer states for <script>.
>>
>> I think the biggest advantage to this kind of approach is that it can be
>> polyfilled with full fidelity. But I am in no way wedded to this solution
>> and I am intrigued at the mention of other approaches with this property.
>> The others I know of (external source only, srcdoc like on iframe) seem
>> clearly worse, but there might be other bigger ones.
>>
>
> The xmp-like wrapper also can be polyfilled, as can approaches based on
> HTML comments or attributes.  There's a trade-off, however.  In the long
> view, it's not clear to me how important polyfillability is for a feature.
>  It certainly makes adoption easier in the short term, but if we constrain
> ourselves to designing only features that can be polyfilled at each step,
> we'll end up with a contorted platform.
>
>
>>  (D) Can be implemented with near-identical behavior for XHTML, except
>>> that you'd need an XML fragment parser.
>>>
>>
>> The downside is that nested templates don't parse the same as top-level
>> templates.
>>
>>
>> Indeed. That is in addition to the previously conceded downsides that the
>> syntax is somewhat less congenial.
>>
>> Another issue is that you've also introduced the following security risk:
>>
>> Today, the following line of JavaScript is safe to include in an inline
>> script tag:
>>
>> var x = "</+script><img onerror=alert(1)>";
>>
>> Because that line does not contain "</script>", the string "alert(1)"
>> will be treated as the contents of a string.  However, if that line is
>> included in an inline script inside of a template, the modifications of to
>> the parser above will mean that alert(1) will execute as JavaScript rather
>> than being treated as a string, introducing an XSS vector.
>>
>>
>> I don't follow. Can you give a full example of how this would be included
>> in a template and therefore be executed?
>>
>
> <script>
> x =  "</+script><img onerror=alert(1)>"; // This is safe, there is no
> script execution.
> </script>
>
> <script type=template id=a>
>    <script>
>     x =  "</+script><img onerror=alert(1)>"; // This is not safe, the
> alert(1) executes as script.
>   </+script>
> </script>
>
> <script>
> document.body.appendChild(document.getTemplateById("a").instantiate());
> </script>
>
> You should imagine, of course, the string not being written literally by
> the developer but instead generated on the server side by some code that
> knows how to escape strings for use in inline script tags (e.g., by
> escaping "\" and "</script").  It's certainly possible to defend against
> this XSS vector on the server, but it's one more XSS vector to worry about.
>
>  I hope this clarifies the proposal.
>>>
>>> Notes:
>>> - Just because it's described this way doesn't mean it has to be
>>> implemented this way - implementations could do template parsing in a
>>> single pass with HTML parsing if desired. I wrote it this way mainly to
>>> demonstrate the desired properties/
>>>
>>
>> I'm not sure how we'd be able to that without running multiple copies of
>> the tokenizer state machine in parallel.  The tokenizer states for the
>> template fragment parser aren't going to line up in any meaningful way with
>> the top-level tokenizer's search for an appropriate end tag to escape from
>> the script data states.
>>
>>
>> I'm pretty confident it's *possible* to do a one-pass version of this
>> algorithm, but I am not sure if it is easy, or if it is desirable.
>>
>> What I actually imagined (knowing much less about HTML parsing than you)
>> was that you'd enter a different tokenizer state after encountering a
>> <script template> than a <script>. But defining that state would be
>> challenging.
>>
>
> Sure, but you're going to need a 17x bigger tokenizer state machine
> because you'll need to track all 17 script data states for the top-level
> tokenizer at the same time as you're tracking all the states for the
> template fragment tokenzier.
>
> Adam
>
>


-- 
morrita
Received on Thursday, 1 November 2012 15:43:14 UTC