- From: Philip Taylor <philip@zaynar.demon.co.uk>
- Date: Wed, 18 Jul 2007 00:55:26 +0100
- To: public-html@w3.org
Henri Sivonen wrote: > On Jul 15, 2007, at 06:28, Jon Barnett wrote: > >> Is there a convenient way for me to search the code of existing >> pages on the web. > > As far as I know, there isn't. Some pointers and ideas, though: > > So far, Hixie has been doing research using Google-internal > facilities that others can't use and Philip Taylor (the one of the > lazyilluminati/canvex fame) has been doing small-scale (top 500 front > pages and the like) research with (at the moment) private facilities. I've put my current data at <http://canvex.lazyilluminati.com/survey/2007-07-17/analyse.cgi/index>. (The graphs are a bit broken in Firefox 2, but should work fine elsewhere). It may be useful for finding examples of sites that use certain features, since the existing surveys seem to only provide aggregate data and don't give any way to trace back to the source. The code for collecting the data is at <http://canvex.lazyilluminati.com/svn/survey/trunk/> (though it's not particularly easy to use, nor particularly efficient or well-designed (and I discovered too late that SQLite is really not good for this), but at least it's there), which makes use of the C++ tokeniser that I wrote in OCaml at <http://canvex.lazyilluminati.com/svn/tokeniser/>. I looked at about 8000 pages (randomly selected from dmoz.org) - they only took 15 minutes to download and analyse, using two computers in parallel, so it should be pretty easy to get data about a much bigger number without unreasonable resource usage. I didn't do more this time, mainly because it was my first attempt and I just wanted to be sure it worked properly, and also I didn't want to worry too much about scalability (especially for providing interactive access to the data, when I don't have a decent web server to run it on), and also because for a lot of the data there is negligible statistical value in having a larger sample. (For rare features with frequency below 1% or so, like the 'headers' attribute, my data is worthless - a much wider survey (like the ones Hixie is doing) would be necessary for that.) dmoz.org certainly isn't the best possible source of URLs - it appears to be strongly biased towards English and (to a lesser extent) European sites, and CNN.com makes up 220K of the 4.5M links (though at least they're mostly old archived pages and so a single change by CNN.com's developers won't affect the statistics from all those pages at once). It would be good to gather similar results for differently-biased sets of pages for comparison. But it's an easy source to get access to, and it allows some comparisons with Rene Saarsoo's work at <http://triin.net/2006/06/12/HTML> from the same population slightly over a year ago: Of the most common tags, some have gone up significantly: meta, script, div, link, span, ... Others have gone down significantly: table, tr, td, p, font, b, center, ... (By "significantly", I mean a few percent of the number of pages. If I'm not misremembering how to do statistics [please correct me if I'm wrong], the error at 95% confidence should be around +/- 0.5% of the number of pages, so these appear to indicate real differences within the population.) Slightly under 80% are rendered in quirks mode now (though that's perhaps an underestimate since I wasn't checking that the doctype was close enough to the beginning of the document). The XHTML Transitional doctype has grown from 5% to 11%, but that's 'almost standards' in most browsers and only ~3% of pages use real standards mode. I didn't find anyone using the HTML5 doctype, though there was one <canvas>. Relating to some earlier discussion about <image>, http://imdb.com/ is a good example of why it has to be a synonym of <img> for compatibility with existing content. There seems to be plenty of other interesting information that can come from this kind of survey, though it's far from perfect and it would be good to find improved or alternate ways to collect useful data. -- Philip Taylor philip@zaynar.demon.co.uk
Received on Tuesday, 17 July 2007 23:55:32 UTC