- From: Jonas Sicking <jonas@sicking.cc>
- Date: Sun, 16 Nov 2008 00:58:36 -0800
- To: "Roy T. Fielding" <fielding@gbiv.com>
- CC: HTML WG <public-html@w3.org>
Roy T. Fielding wrote: > On Nov 14, 2008, at 11:24 PM, Jonas Sicking wrote: >> How browsers parse HTML *is* how HTML must be parsed. Or at least that >> is the case if you presume that current HTML has been written for >> browsers and by testing what works in current browsers. > > Which is obviously false. Most content is written programatically or > for tools that existed in the distant past This is an interesting assertion. If you are correct then I stand corrected and we need to seriously revise how we define HTML5. However before we do that though I think we should try to get some data as I'm (obviously :) ) less convinced than you are. What non-browser tools of the distant past was this content created for? Do you have any idea how much content were created for these tools? Or any ideas for how we would measure that. I think we can get some estimates on how much content has been created for browsers by examining the number of pages in the index of the various search engines. I bet Hixie could get at least an approximate of the number of pages in googles index and how many of those looks like they were intended to be consumed by a browser. > (none of my content, for > example, has ever been written by testing what works in current > browsers even back in the days when current actually meant something). My data shows that your pattern is an exception. Many many pages on the web break if you don't use the complex parsing algorithm that we use today. When Netscape decided to rewrite their browser engine and use what has become gecko (the engine used by firefox), one of the biggest problem with taking marketshare was compatibility with existing pages, even though the new engine was perfectly able to parse HTML 4 by spec. In fact, we can still see this today. While firefox now has a worldwide marketshare of about 20%, our marketshare in many countries in Asia is tiny. Our market research data has shown that the main reason for that is website compatibility. Even though Firefox parses valid HTML4 very well. So while I'm thankful that you used better development strategies than simply testing what works in current browsers, our data shows that most people don't. Unfortunately. > That's why my content doesn't have to be regenerated every six months. I don't understand this statement. No content needs to be regenerated every six months as far as I know. Browsers don't change their parsing algorithms significantly. Anytime we do change our parsing algorithm we do it in order to support more pages out there. Most of those pages are very old, and all of them work in other browsers. And any time we change our parsing algorithm we are freak out worried about it breaking more other pages on the web. Browser developers more than anyone has a reason to dislike the state of HTML parsing. We are the ones that have to write and debug the complex code to do so. The reason we parse HTML the way we do is because our customers ask for it. They have clearly told us several times that the reason they use our products is to view pages on the web. If these pages do not work, the browser is useless to them and they go seek other options. / Jonas
Received on Sunday, 16 November 2008 09:00:34 UTC