- From: Stéphane Corlosquet <scorlosquet@gmail.com>
- Date: Sun, 13 Nov 2011 16:46:31 -0500
- To: Niklas Lindström <lindstream@gmail.com>
- Cc: Manu Sporny <msporny@digitalbazaar.com>, RDFa WG <public-rdfa-wg@w3.org>
- Message-ID: <CAGR+nnE2qAujcKLegSQYGCGpq=+BzHgPZJxkCHZgDc0DH_iSAg@mail.gmail.com>
Hi, 2011/11/13 Niklas Lindström <lindstream@gmail.com> > Hi! > > I've been thinking a bit about this. While we might get somewhere > using regexps, they have to get quite complex to handle the random > order in which attributes appear combined with our needs of matching > *missing* attributes (such as "@typeof and @property on the same > element, but not any other RDFa property"). Also the engine must treat > them as multiline to handle elements with linebreaks between or within > attributes. > (That's taken care of by the multiline regex mode.) I agree that with more than two attributes per tag, the regular expressions get complex (though the permutations could be scripted). I think XPath is a good language to abstract these regex, especially for handling several attributes. We use XPath in Drupal 7 for the tests for example [1]. This expression for example: '//a[@typeof="sioc:UserAccount" and @about=:account-uri and @property="foaf:name"]' matches 'a' elements which have certain value in the @typeof @about and @property. The other benefit of XPath is that you can match beyond the tag, for example find all tags matching certain condition nested in another tag matching some other condition. I know at some point I said XPath might be too much overhead when parsing lots of HTML document compared to plain regex, but as the regex get more complicated, I've changed my mind :) I do not know the actual overhead of XPath compared to plain regex matching, but maybe the pipeline could include a first regex pass, and a second XPath pass if the first pass matches certain regex criteria. Steph. [1] http://drupalcode.org/project/drupal.git/blob/refs/heads/7.x:/modules/rdf/rdf.test > > I'm not saying it can't be done, but I'm wondering if the EC2 Hadoop > setup can be leveraged to do something a bit more structured. > > The Amazon Elastic MapReduce tutorials mention means for running > Python, Ruby or PHP in the map step, so I expect it might be. Perhaps > using xsltproc (with the "--html" option, or with a tidy in front of > it) is possible as well. I chose that (since it is very fast) make a > simple example. The result is an XSLT which at the moment creates TSV > lines with statistics for each element using RDFa (attributes used, is > there an active hanging rel, etc.). This could be piped to a reduce > algorithm for computing answers to the questions we need, or be > adapted to something more directly usable. > > I put this as a gist here: > > https://gist.github.com/1362314 > > (I've run the script against a local copy of the RDFa testsuite, > downloaded using the RDFLib test script [1].) > > Just a thought. > > Best regards, > Niklas > > [1]: > http://code.google.com/p/rdflib/source/browse/test/rdfa/run_w3c_rdfa_testsuite.py > > > On Tue, Nov 8, 2011 at 5:40 PM, Manu Sporny <msporny@digitalbazaar.com> > wrote: > > I started a page for the new Web Crawl Regexes that will measure RDFa > usage > > in the wild, and give us a better idea if the RDFa Lite changes we're > > thinking of making will break existing content out there: > > > > The page is hosted in the Data Driven Standards WG wiki, so you'll have > to > > join that group if you want to edit the wiki: > > > > > http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design > > > > There isn't much there right now, but it's a start. The plan is to turn > > these regexes into a Hadoop map/reduce job and run it on the Amazon > Elastic > > Map Reduce infrastructure on the Common Crawl dataset (5 billion web > pages, > > tens of terabytes of web page data). > > > > -- manu > > > > -- > > Manu Sporny (skype: msporny, twitter: manusporny) > > Founder/CEO - Digital Bazaar, Inc. > > blog: Standardizing Payment Links - Why Online Tipping has Failed > > http://manu.sporny.org/2011/payment-links/ > > > > > >
Received on Sunday, 13 November 2011 21:47:09 UTC