Re: An HTML language specification vs. a browser specification

On Sun, 16 Nov 2008, Jonas Sicking wrote:
> 
> I think we can get some estimates on how much content has been created 
> for browsers by examining the number of pages in the index of the 
> various search engines. I bet Hixie could get at least an approximate of 
> the number of pages in googles index and how many of those looks like 
> they were intended to be consumed by a browser.

Based on various sources I would estimate that there are on the order of 
hundreds of billions of publicly available distinct documents intended for 
Web browsers hosted on servers on the Internet.

As far as I'm aware, based on what I've seen at Google, documents in 
Google's index are all uniformly intended either for Web browsers or for 
Web search engines (the latter pages being from spam sites attempting to 
fraudulently influence the rankings of search engines).

Web search engines need to act as much like browsers as possible, because 
otherwise it would be possible to trick a search engine into thinking that 
the page contained one payload while browsers rendered a different set of 
content. So insofar as the HTML5 spec is concerned, search engines are 
basically equivalent to browsers, and it doesn't matter if a page is aimed 
at the former or the latter, they should both be treated as being 
targetted at the latter. (Google has found HTML5's parsing spec to be very 
useful in terms of improving our ability to act more like browsers.)

I would be very, very interested to find out about the HTML documents that 
aren't written for browsers. If documentation on these vast repositories 
of documents that aren't targetted primarily at browsers could be made 
available, ideally with examples, I would be happy to adjust the spec's 
priorities accordingly. I'm trying to base the spec on an objective 
viewpoint and so far the bias towards browsers, tools aimed at augmenting 
browsers, tools that act like browsers, and authors writing documents and 
applications aimed at people using browsers is purely there because to my 
knowledge the overwhelming majority of HTML content on the Web in fact 
falls into all those categories.

Information to the contrary would be hugely helpful. Roy, if you could 
enlighten us here I would be very grateful.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Sunday, 16 November 2008 23:51:16 UTC