Re: Web Crawl Regexes for RDFa


2011/11/13 Niklas Lindström <>

> Hi!
> I've been thinking a bit about this. While we might get somewhere
> using regexps, they have to get quite complex to handle the random
> order in which attributes appear combined with our needs of matching
> *missing* attributes (such as "@typeof and @property on the same
> element, but not any other RDFa property"). Also the engine must treat
> them as multiline to handle elements with linebreaks between or within
> attributes.

(That's taken care of by the multiline regex mode.)

I agree that with more than two attributes per tag, the regular expressions
get complex (though the permutations could be scripted). I think XPath is a
good language to abstract these regex, especially for handling several
attributes. We use XPath in Drupal 7 for the tests for example [1]. This
expression for example:

'//a[@typeof="sioc:UserAccount" and @about=:account-uri and

matches 'a' elements which have certain value in the @typeof @about and
@property. The other benefit of XPath is that you can match beyond the tag,
for example find all tags matching certain condition nested in another tag
matching some other condition.

I know at some point I said XPath might be too much overhead when parsing
lots of HTML document compared to plain regex, but as the regex get more
complicated, I've changed my mind :) I do not know the actual overhead of
XPath compared to plain regex matching, but maybe the pipeline could
include a first regex pass, and a second XPath pass if the first pass
matches certain regex criteria.



> I'm not saying it can't be done, but I'm wondering if the EC2 Hadoop
> setup can be leveraged to do something a bit more structured.
> The Amazon Elastic MapReduce tutorials mention means for running
> Python, Ruby or PHP in the map step, so I expect it might be. Perhaps
> using xsltproc (with the "--html" option, or with a tidy in front of
> it) is possible as well. I chose that (since it is very fast) make a
> simple example. The result is an XSLT which at the moment creates TSV
> lines with statistics for each element using RDFa (attributes used, is
> there an active hanging rel, etc.). This could be piped to a reduce
> algorithm for computing answers to the questions we need, or be
> adapted to something more directly usable.
> I put this as a gist here:
> (I've run the script against a local copy of the RDFa testsuite,
> downloaded using the RDFLib test script [1].)
> Just a thought.
> Best regards,
> Niklas
> [1]:
> On Tue, Nov 8, 2011 at 5:40 PM, Manu Sporny <>
> wrote:
> > I started a page for the new Web Crawl Regexes that will measure RDFa
> usage
> > in the wild, and give us a better idea if the RDFa Lite changes we're
> > thinking of making will break existing content out there:
> >
> > The page is hosted in the Data Driven Standards WG wiki, so you'll have
> to
> > join that group if you want to edit the wiki:
> >
> >
> >
> > There isn't much there right now, but it's a start. The plan is to turn
> > these regexes into a Hadoop map/reduce job and run it on the Amazon
> Elastic
> > Map Reduce infrastructure on the Common Crawl dataset (5 billion web
> pages,
> > tens of terabytes of web page data).
> >
> > -- manu
> >
> > --
> > Manu Sporny (skype: msporny, twitter: manusporny)
> > Founder/CEO - Digital Bazaar, Inc.
> > blog: Standardizing Payment Links - Why Online Tipping has Failed
> >
> >
> >

Received on Sunday, 13 November 2011 21:47:09 UTC