Extracting RDF from XHTML table with XSLT

Helo,

I have several ideas on extracting (screen scraping) meta data from
ordinary XHTML document, without any special syntax required. They will be
more useful for automated processing if we could tell agents with
media="meta" attribute which XSLT should be used for meta data extraction.


1. Meta data extraction from XHTML table
----------------------------------------

There have been the same kind of approaches such as Dan Connolly's
'HyperRDF' [1], Sean B. Palmer's 'XSLT XHTML to RDF Extractor' [2] and Edd
Dumbill's article on XML.com [3]. My idea is a sort of generalization of
them in the sense that it does not require any special syntax/attribute be
introduced. This means anyone can easily publish meta data by just writing
a valid XHTML.

Suppose we have an XHTML table like:

<table>
 <caption>Books on Whisky</caption>
 <tr><th>Title</th><th>Author</th><th>Place</th></tr>
 <tr><td>The Original Scotch</td><td><a
href="http://example.org/who/brander">Brander,
Michael</a></td><td>London</td></tr>
 ...

The idea is to use each <th> element as property for corresponding <td>
elements. If a <td> element contains <a> element with hyperlink, it will be
a resource object. Otherwise, it will be a literal object.

With appropriate XSLT stylesheet (see sample [4]), the first data row can
be converted to:

<rdf:Description>
 <wn:Title>The Original Scotch</wn:Title>
 <wn:Author rdf:resource="http://example.org/who/brander"
rdfs:label="Brander, Michael"/>
 <wn:Place>London</wn:Place>
</rdf:Description>

where wn: prefix is bound to WordNet namespace. If this <table> element
has, say, 'class="Book"' attribute, then <rdf:Description> element will be
replaced by <wn:Book>, to be a typed node.

In this example, the XHTML author does not have to do anything special for
metadata. The only requirement is that s/he makes well-formed XHTML and,
writes a table so that <th> elements properly correspond to <td> elements.
And, with binding these headings (<th>) to WordNet namespace, they become
machine-understandable properties.

(If an XHTML author uses multi-word <th>, such as "Publishing Place", this
approach fails, but there will be some ways to manage these situations)

Of course, this approach could be applied to match other XHTML constructs
such as lists, links, etc. Maybe <link> element could be used to generate
'samePropertyAs' element for more precise inference.


2. A proposal of media="meta" for xml-stylesheet PI
---------------------------------------------------

Once an XSLT be prepared, we want any Semantic Web agent know the way to
associate it with the document in order to extract meta data. Usually, an
XML document can be associated with a stylesheet by processing instruction
[5]. What we need here is the method to specify that this stylesheet is not
for visual presentation, but for meta data extraction.

It seems natural to use 'media' pseudo attribute of this PI to specify the
role (target) of the stylesheet. Since there is a common understanding that
<link rel="meta" ... />  be used to associate external metadata [6], I
would propose to apply the same value for this 'media' pseudo attribute,
hence media="meta". It might be useful to include 'alternate="yes"' pseudo
attribute in case the document has another styelsheet for visual
presentation.

With this PI, things will become quite easy for XHTML authors. What they
need is just to write appropriate XHTML, and put the PI before root
element. If someone provide common stylesheet library, they do not have to
write their stylesheets. I believe this will encourage many XHTML authors
who want to join Semantic Web, but do not know how to.


[1] http://www.w3.org/2000/07/hs78/
[2] http://www.mysterylights.com/xhtmltordf/
[3] http://www.xml.com/pub/a/2000/11/01/semantwebic/
[4] http://kanzaki.com/docs/sw/table-meta.xsl
[5] http://www.w3.org/TR/xml-stylesheet/
[6] http://infomesh.net/2002/rdfinhtml/


Best regards,

Masahide Kanzaki
http://kanzaki.com/info/webwho.rdf

Received on Sunday, 22 September 2002 11:41:21 UTC