Re: An HTML language specification vs. a browser specification from Maciej Stachowiak on 2008-11-17 (public-html@w3.org from November 2008)

From: Maciej Stachowiak <mjs@apple.com>
Date: Sun, 16 Nov 2008 16:40:05 -0800
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: Jonas Sicking <jonas@sicking.cc>, HTML WG <public-html@w3.org>
Message-id: <C2729376-BD9E-40F8-8249-CF2A47F9B97A@apple.com>
On Nov 15, 2008, at 12:14 PM, Roy T. Fielding wrote:

>
> On Nov 14, 2008, at 11:24 PM, Jonas Sicking wrote:
>> How browsers parse HTML *is* how HTML must be parsed. Or at least  
>> that
>> is the case if you presume that current HTML has been written for
>> browsers and by testing what works in current browsers.
>
> Which is obviously false.  Most content is written programatically or
> for tools that existed in the distant past

Since there is disagreement on factual premises, it seems we cannot  
reach agreement on the facts, perhaps by performing some experiments.

I believe in the following hypotheses, which I believe are in  
principle testable:

1) The vast majority of http traffic on the Internet  to public Web  
servers (counting by request or by byte transferred) has a browser as  
the client. You could test this by sniffing traffic or by surveying  
the logs of some representative servers.

2) Same claim as above, specifically as to http traffic transferring  
text/html documents. Arguably, this claim as well as claim #1 have  
already been tested by any browser market share study - these also  
include non-browser user agents and generally show only a small  
traffic share for them.

3) Most text/html content on the Web displays in a way that is useful  
and meaningful to humans in a Web browser. This could be tested by  
taking a random selection of URLs from a search engine and observing  
how they display in one or more Web browsers.

4) Most text/html content on the public Web (measured weighted by  
poplarity, or if it must be measured by document, excluding unbounded  
programatically generated URL spaces to avoid just comparing two  
infinities) does not validate according to its declared doctype. This  
one has already been proven true by every study done of the matter.

If all of 1-4 are true, then I think the only reasonable theory to  
explain them is that most HTML on the Web is authored with the intent  
of being viewed by users in a browser, and that for most content  
authors correct appearance and behavior in browsers seems to matter  
more than compliance with the relevant specifications.

(Note, this theory consist of positive claims, not normative; I am not  
claiming it is a good thing that the Web operates this way. But I  
believe that it does, and that without agreement on this premise one  
way or the other we cannot have a constructive discussion.)


> (none of my content, for
> example, has ever been written by testing what works in current
> browsers even back in the days when current actually meant something).
> That's why my content doesn't have to be regenerated every six months.

HTML parsing rules don't change every 6 months, so no one has to do  
that. In fact, the reason HTML parsing rules in browsers are so weird  
is so that older content does not have to be regenerated.

> Quite frankly, the only people who hold that view of a browser-centric
> Web are the browser vendors,

A small minority of the HTML Working Group consists of browser  
vendors, yet a significant (indeed overwhelming) majority voted to  
adopt the HTML5 Design Principles, which establish error handling and  
backwards-compatible behavior even in the face of errors as Design  
Principles for this group. This seems to disprove your claim.

> which is why everyone else complains so much about their crappy  
> software.

I am not aware of widespread complaints about the Web content  
processing capabilities of Safari - some people complain about  
specific bugs, but if you do a Google search for WebKit you will far  
more positive than negative comments in this regards. When there are  
complaints (or, more constructively, bug reports), they are almost  
never about the behavior of HTML parsing.


It seems to me that your evaluation of the facts is colored by a  
personal distaste for browsers and browser vendors. Browsers are a  
critical part of the Web ecosystem, and indeed many of the key pieces  
of software in making the Web such a popular medium. Without  
WorldWideWeb, Mosaic, or the original Netscape, it is difficult to  
imagine the Web being anything but a research curiosity.


Regards,
Maciej
Received on Monday, 17 November 2008 00:40:47 UTC