W3C home > Mailing lists > Public > public-rdfa-wg@w3.org > November 2011

Re: Web Crawl Regexes for RDFa

From: Stéphane Corlosquet <scorlosquet@gmail.com>
Date: Sun, 13 Nov 2011 16:46:31 -0500
Message-ID: <CAGR+nnE2qAujcKLegSQYGCGpq=+BzHgPZJxkCHZgDc0DH_iSAg@mail.gmail.com>
To: Niklas Lindström <lindstream@gmail.com>
Cc: Manu Sporny <msporny@digitalbazaar.com>, RDFa WG <public-rdfa-wg@w3.org>

2011/11/13 Niklas Lindström <lindstream@gmail.com>

> Hi!
> I've been thinking a bit about this. While we might get somewhere
> using regexps, they have to get quite complex to handle the random
> order in which attributes appear combined with our needs of matching
> *missing* attributes (such as "@typeof and @property on the same
> element, but not any other RDFa property"). Also the engine must treat
> them as multiline to handle elements with linebreaks between or within
> attributes.

(That's taken care of by the multiline regex mode.)

I agree that with more than two attributes per tag, the regular expressions
get complex (though the permutations could be scripted). I think XPath is a
good language to abstract these regex, especially for handling several
attributes. We use XPath in Drupal 7 for the tests for example [1]. This
expression for example:

'//a[@typeof="sioc:UserAccount" and @about=:account-uri and

matches 'a' elements which have certain value in the @typeof @about and
@property. The other benefit of XPath is that you can match beyond the tag,
for example find all tags matching certain condition nested in another tag
matching some other condition.

I know at some point I said XPath might be too much overhead when parsing
lots of HTML document compared to plain regex, but as the regex get more
complicated, I've changed my mind :) I do not know the actual overhead of
XPath compared to plain regex matching, but maybe the pipeline could
include a first regex pass, and a second XPath pass if the first pass
matches certain regex criteria.



> I'm not saying it can't be done, but I'm wondering if the EC2 Hadoop
> setup can be leveraged to do something a bit more structured.
> The Amazon Elastic MapReduce tutorials mention means for running
> Python, Ruby or PHP in the map step, so I expect it might be. Perhaps
> using xsltproc (with the "--html" option, or with a tidy in front of
> it) is possible as well. I chose that (since it is very fast) make a
> simple example. The result is an XSLT which at the moment creates TSV
> lines with statistics for each element using RDFa (attributes used, is
> there an active hanging rel, etc.). This could be piped to a reduce
> algorithm for computing answers to the questions we need, or be
> adapted to something more directly usable.
> I put this as a gist here:
>    https://gist.github.com/1362314
> (I've run the script against a local copy of the RDFa testsuite,
> downloaded using the RDFLib test script [1].)
> Just a thought.
> Best regards,
> Niklas
> [1]:
> http://code.google.com/p/rdflib/source/browse/test/rdfa/run_w3c_rdfa_testsuite.py
> On Tue, Nov 8, 2011 at 5:40 PM, Manu Sporny <msporny@digitalbazaar.com>
> wrote:
> > I started a page for the new Web Crawl Regexes that will measure RDFa
> usage
> > in the wild, and give us a better idea if the RDFa Lite changes we're
> > thinking of making will break existing content out there:
> >
> > The page is hosted in the Data Driven Standards WG wiki, so you'll have
> to
> > join that group if you want to edit the wiki:
> >
> >
> http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design
> >
> > There isn't much there right now, but it's a start. The plan is to turn
> > these regexes into a Hadoop map/reduce job and run it on the Amazon
> Elastic
> > Map Reduce infrastructure on the Common Crawl dataset (5 billion web
> pages,
> > tens of terabytes of web page data).
> >
> > -- manu
> >
> > --
> > Manu Sporny (skype: msporny, twitter: manusporny)
> > Founder/CEO - Digital Bazaar, Inc.
> > blog: Standardizing Payment Links - Why Online Tipping has Failed
> > http://manu.sporny.org/2011/payment-links/
> >
> >
Received on Sunday, 13 November 2011 21:47:09 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:05:27 UTC