Re: An HTML language specification vs. a browser specification from Roy T. Fielding on 2008-11-17 (public-html@w3.org from November 2008)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Mon, 17 Nov 2008 15:23:48 -0800
To: Jonas Sicking <jonas@sicking.cc>
Cc: HTML WG <public-html@w3.org>
Message-Id: <6B36CA83-4373-4724-85BE-DFC5D580F741@gbiv.com>
On Nov 16, 2008, at 12:58 AM, Jonas Sicking wrote:
> Roy T. Fielding wrote:
>> On Nov 14, 2008, at 11:24 PM, Jonas Sicking wrote:
>>> How browsers parse HTML *is* how HTML must be parsed. Or at least  
>>> that
>>> is the case if you presume that current HTML has been written for
>>> browsers and by testing what works in current browsers.
>> Which is obviously false.  Most content is written programatically or
>> for tools that existed in the distant past
>
> This is an interesting assertion. If you are correct then I stand  
> corrected and we need to seriously revise how we define HTML5.  
> However before we do that though I think we should try to get some  
> data as I'm (obviously :) ) less convinced than you are.
>
> What non-browser tools of the distant past was this content created  
> for? Do you have any idea how much content were created for these  
> tools? Or any ideas for how we would measure that.

Did I say non-browser tools of the distant past?  No.  MSIE6 is a
tool of the distant past.  Firefox 1.0 is as well.  Yet the vast  
majority
of non-program content out there was authored long before those two
browsers existed. Go ahead and check the last-modified timestamps.

You can bet that the content authored a year ago wasn't designed
for the browsers a year ago either -- it was written using an
HTML-generating tool that was designed according to some old HTML
spec, hand-authored using old snippets of HTML knowledge gleaned
from any of a hundred books on the topic (none of which are "current"),
or cut and pasted from older sites.  And, no, it wasn't done by
"testing what works in current browsers" -- most of it wasn't tested
at all because the author has no control over the software used
on the publisher's website.

The occasions in which someone actually crafts an HTML file these
days (not some generated HTML from a Word document or some paragraph
within a content management system) and then tests that complete file
on a "current" browser is extremely rare outside the protocol
development community.  Most content management systems deployed
today were developed with small sets of content tested on ancient
browsers that nobody in their right mind would install on their
system today.

Successful standards efforts define a scope they intend to work within
and then reach agreement within that scope.  HTML is a mark-up language,
yet HTML5's scope appears to be a whole pile of idiotic features which
have no basis for implementation whatsoever: ping, SQL-storage,  
websockets.
workers, ... the list goes on.  You aren't defining a mark-up language,
so stop calling this effort HTML.  It just ends up confusing the folks
who work on the Web architecture (the protocols for communicating  
between
independent implementations).  The architecture is what must work across
all implementations, not just browsers. A mark-up language is a standard
that authoring tools need to adhere to, far more so than browsers, and
none of the new features in HTML5 are going to be implemented by
authoring tools.

Call it an implementation spec for browser developers.  I find it odd
that folks want to define such a thing, thereby eliminating competition
from lightweight clients that don't implement all that crap, but at
least it sets the scope to something on which you might be able to
reach agreement and the rest of us can simply ignore.

>> (none of my content, for
>> example, has ever been written by testing what works in current
>> browsers even back in the days when current actually meant  
>> something).
>
> My data shows that your pattern is an exception. Many many pages on  
> the
> web break if you don't use the complex parsing algorithm that we  
> use today.

That is irrelevant to the mark-up language.  People don't call it
English when dictionary entries are randomly selected and spit out
as a stream unrelated words.  Parsing algorithms have to deal with
random streams and only a small subset of that has to be in the
language. The rest of the algorithm can be defined by a browser spec.

> When Netscape decided to rewrite their browser engine and use what  
> has become gecko (the engine used by firefox), one of the biggest  
> problem with taking marketshare was compatibility with existing  
> pages, even though the new engine was perfectly able to parse HTML  
> 4 by spec.
>
> In fact, we can still see this today. While firefox now has a  
> worldwide marketshare of about 20%, our marketshare in many  
> countries in Asia is tiny. Our market research data has shown that  
> the main reason for that is website compatibility. Even though  
> Firefox parses valid HTML4 very well.
>
> So while I'm thankful that you used better development strategies  
> than simply testing what works in current browsers, our data shows  
> that most people don't. Unfortunately.

I disagree with your logic.  When Firefox came out with a more
standards-based parser, a lot of our customers were happy to switch
to it.  But now that Firefox is getting just as buggy and complex as
the other major browsers, they have no reason to switch at all.
Firefox usage hasn't increased since it decided to be no better
than the others.  Instead, the original firefox team has moved on
and, in a year or two, there will be other fresh ideas on browsing
implementations.  Such has been the case for over 15 years now and
I see no reason for it to stop just because the big four say that
HTML5 is what they want to implement.

>> That's why my content doesn't have to be regenerated every six  
>> months.
>
> I don't understand this statement. No content needs to be regenerated
> every six months as far as I know. Browsers don't change their parsing
> algorithms significantly.

Because "current browsers" change every six months.  In order for me
to design my content for testing on current browsers, I'd have to
regenerate it every six months (more frequently during the cycles
when competition between browser vendors is relevant).

> Anytime we do change our parsing algorithm we do it in order to  
> support
> more pages out there. Most of those pages are very old, and all of  
> them
> work in other browsers. And any time we change our parsing  
> algorithm we
> are freak out worried about it breaking more other pages on the web.
>
> Browser developers more than anyone has a reason to dislike the  
> state of HTML parsing. We are the ones that have to write and debug  
> the complex code to do so.
>
> The reason we parse HTML the way we do is because our customers ask  
> for it. They have clearly told us several times that the reason  
> they use our products is to view pages on the web. If these pages  
> do not work, the browser is useless to them and they go seek other  
> options.

If there is nothing to differentiate your software from others,
then there is no reason to build the software in the first place.

....Roy
Received on Monday, 17 November 2008 23:24:12 UTC