XPath tips from the web scraping trenches from carmen on 2014-07-17 (public-webize@w3.org from July 2014)

From: carmen <_@whats-your.name>
Date: Thu, 17 Jul 2014 18:23:15 +0000
To: public-webize@w3.org
Message-ID: <20140717182315.GA2989@x.clearwire-wmx.net>

  >  http://blog.scrapinghub.com/

XPath v CSS-selectors, often overlooked ala XML v JSON.. 

on the other side of the fence, there is
 http://treesheets.org/   CSS selectors -> JSON

would you do a JSON-LD mapping-frame, to get RDF?
or fork treesheets to "graph sheets"

with perhaps a CSS selector of a resources descriptive zone
and tuples of (CSSSelector, PredicateURI) to finish the triple

as some example fodder, convert this (Ruby) twitter-RDF-er to a graph-sheet (and share your graph-sheet github URI with the list? :)

[0] base = 'https://twitter.com' # base URI
    nokogiri.css('div.tweet').map{|t| # resource selector
      s = base + t.css('a.details').attr('href') # subject URI
      yield s,  Type,               R[SIOCt+'MicroblogPost']
      yield s,  Type,               R[SIOC+'Post']
      yield s,  Creator,            R(base+'/'+t.css('.username b')[0].inner_text)
      yield s,  Name,               t.css('.fullname')[0].inner_text
      yield s,  Atom+"/link/image", R(t.css('.avatar')[0].attr('src'))
      yield s,  Date,               Time.at(t.css('[data-time]')[0].attr('data-time').to_i).iso8601

      content = t.css('.tweet-text')[0]
      content.css('a').map{|a| a.set_attribute 'href', URI.join(base, a.attr 'href') }
      yield s, Content, CleanHTML[content.inner_html]}

Received on Thursday, 17 July 2014 18:23:45 UTC