Re: Crawl of RDFa/Microdata/Microformats

Hi Manu,

as I said to you personally: great job! Thanks for doing this!

I'd like to mention a regexp trick I just learnt, that may open up for
a way to do what I previously thought too cumbersome. It is about how
to write legible, matching regexps for a combination of many
attributes given in *arbitrary order*.

The trick is to use one look-ahead for each attribute, where the
lookahead matches anything up to and including a specific attribute.
This way, the regexp actually scans for each in turn, within the
element, making the pattern a match if all of them are present,
regardless of order.

Here is a working example in javascript (run with e.g. node):

  var hrefRelTypeofPattern =
    /<\S+(?=[^>]*?\shref="(.*?)")(?=[^>]*?\srel="(.*?)")(?=[^>]*?\stypeof(?:="(.*?)")?)/m;

  var m = ' <a href="path/0">0</a> <a rel="related" \
    typeof="Item" \
    href="path/1" \
    class="info" \
    >1</a> <a href="path/2">'.match(hrefRelTypeofPattern);

  console.log({href: m[1], rel: m[2], type: m[3]});

That should print out { href: 'path/1', rel: 'related', type: 'Item'
}, i.e. it only matches the link where all of @rel, @href and @typeof
are present.

Granted, this doesn't seem to allow for matching optional values
(since it then stops when finding *any* of the attributes, not all).
But our use cases are for matching a specific set of attributes
combined, so I now think we may actually go down this path...

(And to think I believed I already had a solid knowledge of regexps.
:) Great to know there are still secrets to unravel!)

Best regards,
Niklas


On Mon, Feb 6, 2012 at 6:02 AM, Manu Sporny <msporny@digitalbazaar.com> wrote:
> As a part of the research to see how RDFa is currently being used in the
> wild, we had a plan to use the Common Crawl data set to analyze RDFa,
> Microdata and Microformats usage. I took some time last week to start
> that work, here are the findings:
>
> http://manu.sporny.org/2012/structured-data-searching/
>
> -- manu
>
> --
> Manu Sporny (skype: msporny, twitter: manusporny)
> Founder/CEO - Digital Bazaar, Inc.
> blog: PaySwarm vs. OpenTransact Shootout
> http://manu.sporny.org/2011/web-payments-comparison/
>

Received on Friday, 17 February 2012 16:33:50 UTC