Re: Proposal: Document.parse() [AKA: Implied Context Parsing] from Ian Hickson on 2012-06-04 (public-webapps@w3.org from April to June 2012)

From: Ian Hickson <ian@hixie.ch>
Date: Mon, 4 Jun 2012 22:47:43 +0000 (UTC)
To: Rafael Weinstein <rafaelw@google.com>
cc: Webapps WG <public-webapps@w3.org>
Message-ID: <Pine.LNX.4.64.1206042145240.378@ps20323.dreamhostps.com>

On Fri, 25 May 2012, Rafael Weinstein wrote:
>
> Now's the time to raise objections to UA's adding support for this 
> feature.

For the record, I very much object to Document.parse(). I think it's a 
terrible API. We should IMHO resolve the use case of "generate a DOM tree 
from script" using a much more robust solution that has compile-time 
syntax checking and so forth, rather than relying on the super-hacky 
"concatenate a bunch of strings and then parse them" solution that authors 
are forced to use today.

innerHTML and document.write() are abominations unto computer science, and 
we are doing nobody any favours by continuing the platform down this road. 
They lead to programming styles that are rife with injection bugs (XSS), 
they are extremely difficult to debug and maintain, and they are terribly 
complicated to implement compared to more structured alternatives. The 
core reasons for these problems, IMHO, are two-fold:

 1. Lack of compile-time syntax checking, which leads to typos not being 
    caught and thus programmer intent not being faithfully represented, 
    and
 2. Putting markup syntax and data at the same level, instead of having
    separating them as with other features in JS.

For example, this kind of bug is easy to introduce and hard to spot or 
debug:

   var heading = '<h1>Hello</h1>';
   // ...
   div.innerHTML = '<h1>' + heading + '</h1>';

Even worse are things like typos:

   tr.innerHTML = '<td>' + c1 + '</td><td>' + c2 + '</td><dt>' + c3 + '</td>; 

Compile-time syntax checking makes this a non-issue. Making data variables 
be qualitatively different than the syntax also solves problems, e.g.:

   var title = "I hate </p> tags.";
   // ...
   div.innerHTML = '<p>Today's topic is: ' + title + '</p>'; // oops, not escaped

There have been several alternative proposals; my personal favourite is 
Anne's E4H solution, basically E4X but simplified just for HTML, which 
I've written a strawman spec for here:

   http://www.hixie.ch/specs/e4h/strawman

I'm happy to write a more serious spec for this if this is something 
anyone is interested in implementing. The above examples become much 
easier to debug. The first one results in very ugly markup visible in the 
output of the page rather than in the weird spacing:

   var heading = '<h1>Hello</h1>';
   // ...
   div.appendChild(<h1>{heading}</h1>);

The second results in a compile-time syntax error so would be caught even 
before the code is reviewed:

   tr.appendChild(<><td>{c1}</td><td>{c2}</td><dt>{c3}</td></>);

The third becomes a non-issue because you don't need to escape text to 
avoid it from being mistaken for markup [1]:

   var title = "I hate </p> tags.";
   // ...
   div.innerHTML = <p>Today's topic is: {title}</p>;

Other proposed solutions include Element.create(), which is less verbose 
than the DOM but still more verbose than innerHTML or E4H; and 
quasistrings, which still suffer from lack of compile-time checking and 
mix markup with data, but at least would be more structured than raw 
strings and could offer better injection protection.

[1] (This is not the same as auto-escaping strings in other contexts. For 
example, E4H doesn't propose to have CSS literals, so a string embedded in 
a style="" attribute wouldn't be automagically safe.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Monday, 4 June 2012 22:48:07 UTC