- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sun, 15 Jul 2007 23:57:57 +0300
- To: Jon Barnett <jonbarnett@gmail.com>
- Cc: "public-html@w3.org WG" <public-html@w3.org>
On Jul 15, 2007, at 06:28, Jon Barnett wrote: > Is there a convenient way for me to search the code of existing > pages on the web. As far as I know, there isn't. Some pointers and ideas, though: So far, Hixie has been doing research using Google-internal facilities that others can't use and Philip Taylor (the one of the lazyilluminati/canvex fame) has been doing small-scale (top 500 front pages and the like) research with (at the moment) private facilities. As far as I am aware, implementations of the HTML5 parsing algorithm are publicly available in Python, Ruby and Java. A tokenizer is available in C++. (Hixie uses a Google-private implementation is Sawzall.) For large-scale research, for performance reasons, it is a good idea to use an implementation that is written in a non-dynamic language and compiles down to native code (either ahead-of-time or just-in- time). According to testing by the aforementioned Philip Taylor, Java and C++ implementations of the tokenizer are *much* faster than the Python impl (which isn't at all surprising). Considering availability outside Google, performance and completeness, it seems to me that my Java implementation[1] fits the survey use case well (Python and [probably] Ruby being slower, the C+ + impl lacking the tree builder and the Sawzall impl being Google- private). I've taken a quick look at spidering frameworks that are available for Java. I wouldn't trust anything that uses the JDK HTTP client. Heritrix[2], the Internet Archive crawler, looks the most promising so far. It uses the Commons HttpClient. Another approach that seems interesting would be relying on the Alexa crawl and not running a crawler of one's own. A survey process running in Amazon EC2 could access the Alexa crawl in S3. I'm not sure if the crawl results can be read directly or if one has to get a handle from an Alexa search query.[3] [1] http://www.w3.org/mid/C8358BCA-AD14-41A7-A0B4-2D9F21C1E055@iki.fi [2] http://crawler.archive.org/ [3] http://developer.amazonwebservices.com/connect/entry.jspa? externalID=801&categoryID=120 -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Sunday, 15 July 2007 20:58:08 UTC