Researching HTML usage in existing content (was: Re: unifying alternate content across embedded content element types) from Henri Sivonen on 2007-07-15 (public-html@w3.org from July 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sun, 15 Jul 2007 23:57:57 +0300
To: Jon Barnett <jonbarnett@gmail.com>
Cc: "public-html@w3.org WG" <public-html@w3.org>
Message-Id: <B6A628E0-9619-4A36-8B10-EF5CBF349E91@iki.fi>

On Jul 15, 2007, at 06:28, Jon Barnett wrote:

> Is there a convenient way for me to search the code of existing  
> pages on the web.

As far as I know, there isn't. Some pointers and ideas, though:

So far, Hixie has been doing research using Google-internal  
facilities that others can't use and Philip Taylor (the one of the  
lazyilluminati/canvex fame) has been doing small-scale (top 500 front  
pages and the like) research with (at the moment) private facilities.

As far as I am aware, implementations of the HTML5 parsing algorithm  
are publicly available in Python, Ruby and Java. A tokenizer is  
available in C++. (Hixie uses a Google-private implementation is  
Sawzall.)

For large-scale research, for performance reasons, it is a good idea  
to use an implementation that is written in a non-dynamic language  
and compiles down to native code (either ahead-of-time or just-in- 
time). According to testing by the aforementioned Philip Taylor, Java  
and C++ implementations of the tokenizer are *much* faster than the  
Python impl (which isn't at all surprising).

Considering availability outside Google, performance and  
completeness, it seems to me that my Java implementation[1] fits the  
survey use case well (Python and [probably] Ruby being slower, the C+ 
+ impl lacking the tree builder and the Sawzall impl being Google- 
private).

I've taken a quick look at spidering frameworks that are available  
for Java. I wouldn't trust anything that uses the JDK HTTP client.  
Heritrix[2], the Internet Archive crawler, looks the most promising  
so far. It uses the Commons HttpClient.

Another approach that seems interesting would be relying on the Alexa  
crawl and not running a crawler of one's own. A survey process  
running in Amazon EC2 could access the Alexa crawl in S3. I'm not  
sure if the crawl results can be read directly or if one has to get a  
handle from an Alexa search query.[3]

[1] http://www.w3.org/mid/C8358BCA-AD14-41A7-A0B4-2D9F21C1E055@iki.fi
[2] http://crawler.archive.org/
[3] http://developer.amazonwebservices.com/connect/entry.jspa? 
externalID=801&categoryID=120

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Sunday, 15 July 2007 20:58:08 UTC