W3C home > Mailing lists > Public > www-rdf-interest@w3.org > September 2002

Re: Extracting RDF from XHTML table with XSLT

From: Seth Russell <seth@robustai.net>
Date: Sun, 22 Sep 2002 09:50:53 -0700
Message-ID: <3D8DF4ED.8090705@robustai.net>
To: rdfig <www-rdf-interest@w3.org>

Masahide Kanzaki wrote:

 >I have several ideas on extracting (screen scraping) meta data from
 >ordinary XHTML document, without any special syntax required. They will be
 >more useful for automated processing if we could tell agents with
 >media="meta" attribute which XSLT should be used for meta data extraction.
 >
Good Idea :)

 >(If an XHTML author uses multi-word <th>, such as "Publishing Place", this
 >approach fails, but there will be some ways to manage these situations)
 >
May I suggest that there is a fairly obvious transformation of multi
word phrases to rdf compatable Id's.  Remove punctuation, lower case the
phrase (so will produce more matches) and substitute '_' for space.
  Where a phrase contains a quote, substitute '-' for space inside the
quotation.

For example:

Publishing Place  --> publishing_place

"For Whom the Bell Tolls" by Earnest Hemengway  -->
for-whom-the-bell-tolls_by_earnest_hemengway

Id's of this nature have the advantage of being round trippable between
RDF, N3, Semenglish, and I believe KIF.  Yet they are very human
friendly too!

Seth Russell
http://robustai.net/sailor/
Received on Sunday, 22 September 2002 12:51:34 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:51:56 GMT