Re: An HTML language specification vs. a browser specification from Jonas Sicking on 2008-11-16 (public-html@w3.org from November 2008)

From: Jonas Sicking <jonas@sicking.cc>
Date: Sun, 16 Nov 2008 00:58:36 -0800
To: "Roy T. Fielding" <fielding@gbiv.com>
CC: HTML WG <public-html@w3.org>
Message-ID: <491FE0BC.7080402@sicking.cc>

Roy T. Fielding wrote:
> On Nov 14, 2008, at 11:24 PM, Jonas Sicking wrote:
>> How browsers parse HTML *is* how HTML must be parsed. Or at least that
>> is the case if you presume that current HTML has been written for
>> browsers and by testing what works in current browsers.
> 
> Which is obviously false.  Most content is written programatically or
> for tools that existed in the distant past

This is an interesting assertion. If you are correct then I stand 
corrected and we need to seriously revise how we define HTML5. However 
before we do that though I think we should try to get some data as I'm 
(obviously :) ) less convinced than you are.

What non-browser tools of the distant past was this content created for? 
Do you have any idea how much content were created for these tools? Or 
any ideas for how we would measure that.

I think we can get some estimates on how much content has been created 
for browsers by examining the number of pages in the index of the 
various search engines. I bet Hixie could get at least an approximate of 
the number of pages in googles index and how many of those looks like 
they were intended to be consumed by a browser.

> (none of my content, for
> example, has ever been written by testing what works in current
> browsers even back in the days when current actually meant something).

My data shows that your pattern is an exception. Many many pages on the
web break if you don't use the complex parsing algorithm that we use today.

When Netscape decided to rewrite their browser engine and use what has 
become gecko (the engine used by firefox), one of the biggest problem 
with taking marketshare was compatibility with existing pages, even 
though the new engine was perfectly able to parse HTML 4 by spec.

In fact, we can still see this today. While firefox now has a worldwide 
marketshare of about 20%, our marketshare in many countries in Asia is 
tiny. Our market research data has shown that the main reason for that 
is website compatibility. Even though Firefox parses valid HTML4 very well.

So while I'm thankful that you used better development strategies than 
simply testing what works in current browsers, our data shows that most 
people don't. Unfortunately.

> That's why my content doesn't have to be regenerated every six months.

I don't understand this statement. No content needs to be regenerated
every six months as far as I know. Browsers don't change their parsing
algorithms significantly.

Anytime we do change our parsing algorithm we do it in order to support
more pages out there. Most of those pages are very old, and all of them
work in other browsers. And any time we change our parsing algorithm we
are freak out worried about it breaking more other pages on the web.

Browser developers more than anyone has a reason to dislike the state of 
HTML parsing. We are the ones that have to write and debug the complex 
code to do so.

The reason we parse HTML the way we do is because our customers ask for 
it. They have clearly told us several times that the reason they use our 
products is to view pages on the web. If these pages do not work, the 
browser is useless to them and they go seek other options.

/ Jonas

Received on Sunday, 16 November 2008 09:00:34 UTC