Re: Issues arising from not reparsing

On Thu, 13 Aug 2009, Henri Sivonen wrote:
> 
> The current magic without all the magic that current browsers implement 
> lead to some incompatibilities with existing content. I don't know how 
> often a user would hit these issues, but when the problems do occur, 
> they wreck the whole page. Therefore, I think we should seriously try to 
> improve the magic so that it substitutes the current browser magic 
> better in practice while still not doing reparsing.
> 
> Here are points that need research, in my opinion:
> 
>  1) Would removing the escape flag from xmp, title and textarea improve 
> or degrade Web compat given no reparsing? To research this, I suggest 
> parsing a substantial body of Web content with the current parsing 
> algorithm and then grepping the text content of every xmp element for 
> |<!--.*</xmp| (ignoring case and letting . match over line breaks). 
> (Likewise for textarea and title, except rejecting hits where any part 
> of "<!--" or "</title" has been entity-escaped.) Basically, if there are 
> almost no hits, it would be safer to zap the escape flag from these 
> elements, because accidentally having <!-- eat up the rest of the page 
> is worse than terminating one of these element prematurely very rarely.
> 
>  2) Would making comments and escape runs close on --\s+!> improve or 
> degrade Web compat given no reparsing? To research this, I suggest 
> grepping |--\s+!>| a substantial body of Web content and analyzing the 
> hits.
> 
>  3) Would making --!> and --\s+> close escapes improve or degrade Web 
> compat given no reparsing? To research this, I suggest parsing a 
> substantial body of Web content with the current parsing algorithm and 
> then grepping the text content of every script and style element for 
> |--!>| and |--\s+>| and analyzing the hits.
> 
>  4) Would making <!-- not open an espace when there's non-whitespace on 
> the line before it improve or degrade Web compat given no reparsing? To 
> research this, I suggest parsing a substantial body of Web content with 
> the current parsing algorithm and then grepping the text content of 
> every script and style element for |^.*\S.*<!--| and analyzing the hits.
> 
> Hixie, have you already run these analyses? If not, it would be awesome 
> if someone who already maintains the capability to run these searches 
> could run them. (I volunteer to perform the "analyze the hits" parts, 
> but I don't currently have the readiness to run the searches.)

Unfortunately, my own parser to do these studies is woefully out of date, 
and I don't have the time to bring it up to date or to adapt an existing 
implementation into the framework.

An alternative option, however, would be to instrument the Mozilla parser 
to perform these tests, and then report th data so collected by nightly 
build users back to a central server. Would that work?

I would love to be able to make this stuff simpler.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Friday, 4 September 2009 01:15:43 UTC