Re: Data in HTML Crawl

Marcos, so sorry - I had missed your e-mail until just now!

On 11/21/2011 03:28 AM, Marcos Caceres wrote:
>> http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design
>
 > I think what we need to do is get a complete map of all the tags and
> their attributes. I don't think we need to crawl every single page
> in the index (that would probably distort the results), so I think
> what we need to do is get an idea of how many domains are in the
> index, and then do random selection of pages from those: some domains
> hold more pages than others, so if one domain holds 1,000,000+ pages
> all of which use the same PHP template (e.g., wikipedia), then that
> would screw up the results because it would be overly
> representative.

The data is in ARC format, it's not a randomly-accessible file format...
at least, not in the traditional sense. We could fudge it, but we're
talking about doing quite a bit of text file processing.

> 1. Create a database that will hold the elements, attributes, and the
> frequency of each occurrence (element and attribute).

We kind of get that for free via map/reduce based on what we end up
storing as keys.

> 2. Pick random page from random domain.

I don't think we can do this based on the file format, but I could be wrong.

> 3. Parse page with HTML5 Lib: this will build a correct DOM for every
> document.

Doing this could increase the cost from $150 to $1000+ - don't
underestimate the amount of processing overhead that is required to
create an HTML5 DOM. We may be better off, from a cost perspective,
running regexes over the data at first.

> 4. For each each element in the document, populate the database with
> the name of the element, and each attribute. 5. If attribute has name
> found in the Wiki (about, contents, datatype, class etc.), record its
> attribute value(s).

Yep, we could do this... we've started a wiki page, have you seen it yet?

http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design

> Round 2 could be more focused.
>
> 1. Search all domains for use of [RDFa|Microdata|Microformat] and
> recording also how many times they are not encountered (for
> balance).

We may want to do this in round 2 since crawling over the data is the
most expensive part... not necessarily doing analysis like you show in
the item above.

> 2. Record the fragment structure to a given depth (<foo>
> x<bar>bah<baz>  zzz</bar>  qqq</foo>).

Hrm... multiple terabytes worth of data, possibly... it would give us an
interesting set of data to pour through... but I'm afraid that unless we
automate it, it'll be very difficult to make good discoveries. Most of
what we see will end up being effectively anecdotal evidence.

> 3. Analyse the common usage  patterns (e.g., is an address, person,
> or event marked up in a valid way?)

Also difficult... we need to figure out a way that we can automate this
process.

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
Founder/CEO - Digital Bazaar, Inc.
blog: The Need for Data-Driven Standards
http://manu.sporny.org/2011/data-driven-standards/

Received on Monday, 19 December 2011 04:41:31 UTC