- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Mon, 17 Nov 2008 15:23:48 -0800
- To: Jonas Sicking <jonas@sicking.cc>
- Cc: HTML WG <public-html@w3.org>
On Nov 16, 2008, at 12:58 AM, Jonas Sicking wrote: > Roy T. Fielding wrote: >> On Nov 14, 2008, at 11:24 PM, Jonas Sicking wrote: >>> How browsers parse HTML *is* how HTML must be parsed. Or at least >>> that >>> is the case if you presume that current HTML has been written for >>> browsers and by testing what works in current browsers. >> Which is obviously false. Most content is written programatically or >> for tools that existed in the distant past > > This is an interesting assertion. If you are correct then I stand > corrected and we need to seriously revise how we define HTML5. > However before we do that though I think we should try to get some > data as I'm (obviously :) ) less convinced than you are. > > What non-browser tools of the distant past was this content created > for? Do you have any idea how much content were created for these > tools? Or any ideas for how we would measure that. Did I say non-browser tools of the distant past? No. MSIE6 is a tool of the distant past. Firefox 1.0 is as well. Yet the vast majority of non-program content out there was authored long before those two browsers existed. Go ahead and check the last-modified timestamps. You can bet that the content authored a year ago wasn't designed for the browsers a year ago either -- it was written using an HTML-generating tool that was designed according to some old HTML spec, hand-authored using old snippets of HTML knowledge gleaned from any of a hundred books on the topic (none of which are "current"), or cut and pasted from older sites. And, no, it wasn't done by "testing what works in current browsers" -- most of it wasn't tested at all because the author has no control over the software used on the publisher's website. The occasions in which someone actually crafts an HTML file these days (not some generated HTML from a Word document or some paragraph within a content management system) and then tests that complete file on a "current" browser is extremely rare outside the protocol development community. Most content management systems deployed today were developed with small sets of content tested on ancient browsers that nobody in their right mind would install on their system today. Successful standards efforts define a scope they intend to work within and then reach agreement within that scope. HTML is a mark-up language, yet HTML5's scope appears to be a whole pile of idiotic features which have no basis for implementation whatsoever: ping, SQL-storage, websockets. workers, ... the list goes on. You aren't defining a mark-up language, so stop calling this effort HTML. It just ends up confusing the folks who work on the Web architecture (the protocols for communicating between independent implementations). The architecture is what must work across all implementations, not just browsers. A mark-up language is a standard that authoring tools need to adhere to, far more so than browsers, and none of the new features in HTML5 are going to be implemented by authoring tools. Call it an implementation spec for browser developers. I find it odd that folks want to define such a thing, thereby eliminating competition from lightweight clients that don't implement all that crap, but at least it sets the scope to something on which you might be able to reach agreement and the rest of us can simply ignore. >> (none of my content, for >> example, has ever been written by testing what works in current >> browsers even back in the days when current actually meant >> something). > > My data shows that your pattern is an exception. Many many pages on > the > web break if you don't use the complex parsing algorithm that we > use today. That is irrelevant to the mark-up language. People don't call it English when dictionary entries are randomly selected and spit out as a stream unrelated words. Parsing algorithms have to deal with random streams and only a small subset of that has to be in the language. The rest of the algorithm can be defined by a browser spec. > When Netscape decided to rewrite their browser engine and use what > has become gecko (the engine used by firefox), one of the biggest > problem with taking marketshare was compatibility with existing > pages, even though the new engine was perfectly able to parse HTML > 4 by spec. > > In fact, we can still see this today. While firefox now has a > worldwide marketshare of about 20%, our marketshare in many > countries in Asia is tiny. Our market research data has shown that > the main reason for that is website compatibility. Even though > Firefox parses valid HTML4 very well. > > So while I'm thankful that you used better development strategies > than simply testing what works in current browsers, our data shows > that most people don't. Unfortunately. I disagree with your logic. When Firefox came out with a more standards-based parser, a lot of our customers were happy to switch to it. But now that Firefox is getting just as buggy and complex as the other major browsers, they have no reason to switch at all. Firefox usage hasn't increased since it decided to be no better than the others. Instead, the original firefox team has moved on and, in a year or two, there will be other fresh ideas on browsing implementations. Such has been the case for over 15 years now and I see no reason for it to stop just because the big four say that HTML5 is what they want to implement. >> That's why my content doesn't have to be regenerated every six >> months. > > I don't understand this statement. No content needs to be regenerated > every six months as far as I know. Browsers don't change their parsing > algorithms significantly. Because "current browsers" change every six months. In order for me to design my content for testing on current browsers, I'd have to regenerate it every six months (more frequently during the cycles when competition between browser vendors is relevant). > Anytime we do change our parsing algorithm we do it in order to > support > more pages out there. Most of those pages are very old, and all of > them > work in other browsers. And any time we change our parsing > algorithm we > are freak out worried about it breaking more other pages on the web. > > Browser developers more than anyone has a reason to dislike the > state of HTML parsing. We are the ones that have to write and debug > the complex code to do so. > > The reason we parse HTML the way we do is because our customers ask > for it. They have clearly told us several times that the reason > they use our products is to view pages on the web. If these pages > do not work, the browser is useless to them and they go seek other > options. If there is nothing to differentiate your software from others, then there is no reason to build the software in the first place. ....Roy
Received on Monday, 17 November 2008 23:24:12 UTC