XSLT for screen-scraping RDF out of real-world data from Dan Connolly on 2000-03-21 (www-rdf-interest@w3.org from March 2000)

From: Dan Connolly <connolly@w3.org>
Date: Tue, 21 Mar 2000 12:39:40 -0600
To: www-rdf-interest@w3.org
Message-ID: <38D7C1EC.2062AB7E@w3.org>

Summary:

I believe that one of the best ways to transition into RDF,
if not a long-term deployment strategy for RDF, is to manage the
information in human-consumable form (XHTML) annotated with just
enough info to extract the RDF statements that the human info
is intended to convey. In other words: using a relational
database or some sort of native RDF data store, and spitting
out HTML dynamically, is a lot of infrastructure to operate
and probably not worth it for lots of interesting cases.

We all know that we have to produce a human-readable version
of the thing... why not use that as the primary source?

Details...

During the design of the key/keyref/unique stuff in the XML Schema
WG[1,2]
(base on the match/use stuff in XSLT[3]), a lightbulb went on in my
head about the relationship of XML trees to relational tables: the
idiom:

	<key name="pNumKey">
		<selector>..per-row xpath..</selector>
		<field>..xpath to find 1st field in the row..</field>
		<field>..xpath for 2nd field..</field>
		<field>..xpaht for 3rd row..</field>
	</key>

extracts a relational table from an XML tree: you get one row
in the table for each node matched by the selector, and one
field in that row for each node matched by a field xpath,
using the row node as the context. Cool, huh?

Then, exploiting the fact that RDF and relational tables are
pretty much isomorphic[4], it occured to me that we can use this
idiom to extract RDF data from "real world"[5] stuff: meeting
records (attendee lists, decisions, actions), issue lists,
maybe even hypermail archive indexes, etc. And, of course,
to take this home to where I live, the W3C tech reports
index[6].

Last night, I finally managed to get enough development
tools installed etc. to do some XSLT hacking. I developed
a transformation from the /TR/ page[6] into RDF statements about
dublin core metadata. It's attached in full, but the
gist of it is:

=====
<template match="h:dl/h:dt[./h:b/h:i]">
  <element name="rdf:Description">
   <attribute name="about"><value-of select=".//h:a/@href"/></attribute>
   <dc:title><value-of select=".//h:a"/></dc:title>
   <dc:date><value-of select="substring-before(following-sibling::h:dd,
',')"/></dc:date>
  </element>
</template>
=====

i.e. find all the dt's in dl's that have b and i in them,
and spit out an RDF description of the tech report, giving
the dublin core title and date.

(oh... I had to xhtml-ize the /TR/ page first... tidy[7] to the rescue!)

The results is:

===
<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:h="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/DC"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <rdf:Description about="ATAG10">
      <dc:title>Authoring Tool
Accessibility Guidelines 1.0</dc:title>
      <dc:date>3 February 2000</dc:date>
   </rdf:Description>
...
</rdf:RDF>
===

I need some XSLT functions for relativizing and absolutizing URIs...
but that should be an easy hack.

Anyway... SIRPAC seemed to agree that it was conforming RDF.

There are a couple ideas I want to persue:

Idea 1: semantic HTML

Take the information in my XSLT stylesheet, which says something
about the semantics implied by the XHTML stuff, and put it in
the XHTML in the first place. Something like:

	<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
	...
	</head>
	<body>
	...
	<dl>
	  <dt><b><i>spec title...</i></b></dt>
		<dd>1 Mar 2000, Fred and Bob</dd>
	...
	</dl>
	</body>
	<legend xmlns="http://www.w3.org/2000/03/xml-kb/#"
		xmlns:dc="http://purl.org/DC/">
	  <each select="h:dl/h:dt[./h:b/h:i]">
	   <asserts subjectRef=".//h:a/@href"
		predicateName="dc:title"
		objectLit=".//h:a" />

	   <asserts subjectRef=".//h:a/@href"
		predicateName="dc:date"
		objectLit="substring-before(following-sibling::h:dd, ',')" />
	  </foreach>
	</legend>
	</html>

The <legend/> stuff might seem more natural inside the <head>, but
for performance reasons, I think it's better to put it at the end.

hmm... I guess there are syntactic details...
pointing to resources by QName vs. URI, literals vs. URIs,
anonymous nodes, etc.
But I hope you get the idea.

The beauty of it is: you only need an RDF priest to set up the
<legend> in the first place. After that, anybody with basic
word processing skills (well... ok... a little more than that)
can maintain the <dl> with regular old HTML tools (well... ok...
XHTML tools that I expect are just around the corner-- all they
have to do is (a) maintain xml well-formedness and (b) leave
foriegn elements alone. I think something like hotmetal
or XED or emacs/vi/notepad would work fine).

And it's not a matter of post-hoc, 3rd party interpretation of the HTML,
the way most screen-scraping is done. These semantics are 1st party
assertions. They can be digitally signed, managed, copied around,
versioned, etc. without jumping through hoops.

XSLT has its warts, but it works and it's getting deployed. To
me, it almost makes me wonder "why bother?" regarding
a new RDF syntax.

I think I can generalize my explict template matchine XSLT script
into a general purpose <legend> processing script.
Hmm... maybe not... maybe I would need to do two XSLT
transforms: one from <lenend> into a concrete XSLT transform,
then another one to extract the RDF data. I'll have to noodle on
it some more...


Idea 2: the paper trail

Have any of you seen timbl's notes on the paper trail[8]?

The idea is, for example: given last month's credit card statement
and a set of transaction receipts, produce a new statement.

Or: given the current version of the W3C tech reports index,
and an approved request to publish, produce the new version
of the tech reports index... if this publication replaces
an existing one, elide the old one.

Or: given a calendar and an appointment request, (a) check
for existing conflicts, and (b) generate the new calendar.

In general: given state N and a transaction log, produce state N+1.
Reminds me of the M3 stableDB thingy[9], or qddb[10], or lots
of other similar hacks.


The general idea here is that I think it's more effecient and robust
to store the XHTML representation of the W3C tech reports index
and serve it out of the filesystem than to generate it out of
a database dynamically. But I do want database-ish integrity.

I suppose this could be done just with XSLT, but I suspect
you'll get more bang-for-your-buck if you
	-- extract RDF using XSLT
		(which allows you to merge from multiple sources without thinking
hard)
	-- manipulate the RDF in a prolog-ish way
		(hmm... XSLT implementations in java tend to be
		easily extensible... and I'm sure there are plenty
		of logic programming libraries in Java... I suppose
		we could write these manipulations right into
		(extended) XSLT scripts!)
	-- convert the result back to HTML


[1] 3.9 Identity-constraint Definition Details in the structures spec
http://www.w3.org/TR/xmlschema-1/#Identity-constraint_Definition_details

[2] 4.2 Defining Keys and their References in the primer
http://www.w3.org/TR/xmlschema-0/#specifying Keys&theirRefs

[3] 12.2 Keys in XSLT
http://www.w3.org/TR/xslt#key

[4] Yang, Thu, 9 Mar 2000 11:59:12 -0800 
http://lists.w3.org/Archives/Public/www-rdf-interest/2000Mar/0074.html

[5] RDF in the real world Stallion, Jason (Cahners) (Mon, Mar 13 2000) 
http://lists.w3.org/Archives/Public/www-rdf-interest/2000Mar/thread.html

[6] W3C Technical Reports and Publications
http://www.w3.org/TR/

[7] Clean up your Web pages with HTML TIDY
http://www.w3.org/People/Raggett/tidy/

[8] TimBL, Feb 1999
http://www.w3.org/DesignIssues/PaperTrail.html

[9] Stable.ig in the Modula 3 library source
http://www.research.digital.com/SRC/m3sources/html/stable/src/Stable.ig.html

[10] The Official Qddb Home Page
http://www.hsdi.com/qddb/

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/

<stylesheet 
    xmlns="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:h="http://www.w3.org/1999/xhtml"
    xmlns:dc="http://purl.org/DC"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<output method="xml" indent="yes"/>

<template match="h:html">
 <rdf:RDF>
 <apply-templates/>
 </rdf:RDF>
</template>

<template match="h:dl/h:dt[./h:b/h:i]">
  <element name="rdf:Description">
   <attribute name="about"><value-of select=".//h:a/@href"/></attribute>
   <dc:title><value-of select=".//h:a"/></dc:title>
   <dc:date><value-of select="substring-before(following-sibling::h:dd, ',')"/></dc:date>
  </element>
</template>

<!-- don't pass text thru -->
<template match="text()|@*">
</template>
</stylesheet>

Received on Tuesday, 21 March 2000 13:42:25 UTC