Plan and Infrastructure from Marcos Caceres on 2012-11-15 (public-webdevdata@w3.org from November 2012)

From: Marcos Caceres <w3c@marcosc.com>
Date: Thu, 15 Nov 2012 12:48:35 +0000
To: public-webdevdata@w3.org
Message-ID: <6C83A68A5B6040AD80D4A0AADC4355ED@marcosc.com>

Hi,
Thanks for helping get the group going. I've set up a repo on Github:

https://github.com/Webdevdata

And registered the domain webdevdata.org (not active yet).

=== The Plan ===
Basically, we want to recreate Opera's MAMA [1] (or similar), but extend that a little bit:

1. For each country in the world, we want to know their top 500 Websites. I'm thinking we get this from Alexa. My rough guess is that there will be about 50K urls to download * 30 = ~1,500,000 requests.

2. The above will contain duplicates when compared across countries (e.g., most western countries will list google.com, facebook.com, etc.). So we chuck duplicates out and keep one.

3. Where permitted by robots.txt, we d/l each webpage (we identify ourselves as IE8 or whatever).

4. We create a DOM from the data, and once the page is fully active, we then crawl the DOM to check the structure and attribute values.

5. Where images are references, we do a HEAD request as needed; where a file extension is given for an image, we assume the image type is what it says it is.

6. We put what we find in a database and throw away the downloaded HTML data (or we can keep it, but it's gonna be a lot).

7. From the data, we generate reports about tags, what tags are inside tags, images, and other resources, etc.

8. Rinse and repeat once a month.

The important distinction in our approach is that we don't just look at text files of HTML, we actually see what the DOM looks like and analyse on DOMContentLoaded or onload. The good thing there is that we can use CSS selectors or just walk the tree manually.

To get started, we basically just need a prototype crawler. To maximise participation, I think we should use Javascript based solutions (e.g, Node, Phantom JS) because that's likely what most people in this group will know (i.e., I don't know any other programming languages well enough, nor really care to learn them right now:)).

We should keep everything really simple at first: take a prototype approach and refine as we go. I've made a repo for that:
https://github.com/Webdevdata/prototypes

Lots of things we need to work out. Looking forward to working with you guys on this.

If you can, please try to get more coders involved! Lets make awesome stuff :)

[1] http://dev.opera.com/articles/view/mama/

--
Marcos Caceres
http://datadriven.com.au

Received on Thursday, 15 November 2012 12:49:05 UTC