W3C home > Mailing lists > Public > public-html@w3.org > July 2007

Re: Researching HTML usage in existing content

From: Philip Taylor <philip@zaynar.demon.co.uk>
Date: Wed, 18 Jul 2007 00:55:26 +0100
Message-ID: <469D56EE.6080001@zaynar.demon.co.uk>
To: public-html@w3.org

Henri Sivonen wrote:
> On Jul 15, 2007, at 06:28, Jon Barnett wrote:
>> Is there a convenient way for me to search the code of existing  
>> pages on the web.
> As far as I know, there isn't. Some pointers and ideas, though:
> So far, Hixie has been doing research using Google-internal  
> facilities that others can't use and Philip Taylor (the one of the  
> lazyilluminati/canvex fame) has been doing small-scale (top 500 front  
> pages and the like) research with (at the moment) private facilities.

I've put my current data at 
(The graphs are a bit broken in Firefox 2, but should work fine 
elsewhere). It may be useful for finding examples of sites that use 
certain features, since the existing surveys seem to only provide 
aggregate data and don't give any way to trace back to the source.

The code for collecting the data is at 
<http://canvex.lazyilluminati.com/svn/survey/trunk/> (though it's not 
particularly easy to use, nor particularly efficient or well-designed 
(and I discovered too late that SQLite is really not good for this), but 
at least it's there), which makes use of the C++ tokeniser that I wrote 
in OCaml at <http://canvex.lazyilluminati.com/svn/tokeniser/>.

I looked at about 8000 pages (randomly selected from dmoz.org) - they 
only took 15 minutes to download and analyse, using two computers in 
parallel, so it should be pretty easy to get data about a much bigger 
number without unreasonable resource usage. I didn't do more this time, 
mainly because it was my first attempt and I just wanted to be sure it 
worked properly, and also I didn't want to worry too much about 
scalability (especially for providing interactive access to the data, 
when I don't have a decent web server to run it on), and also because 
for a lot of the data there is negligible statistical value in having a 
larger sample.

(For rare features with frequency below 1% or so, like the 'headers' 
attribute, my data is worthless - a much wider survey (like the ones 
Hixie is doing) would be necessary for that.)

dmoz.org certainly isn't the best possible source of URLs - it appears 
to be strongly biased towards English and (to a lesser extent) European 
sites, and CNN.com makes up 220K of the 4.5M links (though at least 
they're mostly old archived pages and so a single change by CNN.com's 
developers won't affect the statistics from all those pages at once). It 
would be good to gather similar results for differently-biased sets of 
pages for comparison. But it's an easy source to get access to, and it 
allows some comparisons with Rene Saarsoo's work at 
<http://triin.net/2006/06/12/HTML> from the same population slightly 
over a year ago:

Of the most common tags, some have gone up significantly: meta, script, 
div, link, span, ...
Others have gone down significantly: table, tr, td, p, font, b, center, ...

(By "significantly", I mean a few percent of the number of pages. If I'm 
not misremembering how to do statistics [please correct me if I'm 
wrong], the error at 95% confidence should be around +/- 0.5% of the 
number of pages, so these appear to indicate real differences within the 

Slightly under 80% are rendered in quirks mode now (though that's 
perhaps an underestimate since I wasn't checking that the doctype was 
close enough to the beginning of the document). The XHTML Transitional 
doctype has grown from 5% to 11%, but that's 'almost standards' in most 
browsers and only ~3% of pages use real standards mode.

I didn't find anyone using the HTML5 doctype, though there was one 
<canvas>. Relating to some earlier discussion about <image>, 
http://imdb.com/ is a good example of why it has to be a synonym of 
<img> for compatibility with existing content. There seems to be plenty 
of other interesting information that can come from this kind of survey, 
though it's far from perfect and it would be good to find improved or 
alternate ways to collect useful data.

Philip Taylor
Received on Tuesday, 17 July 2007 23:55:32 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 29 October 2015 10:15:24 UTC