W3C home > Mailing lists > Public > www-rdf-interest@w3.org > December 2004

Re: Extracting information from web pages

From: Danny Ayers <danny.ayers@gmail.com>
Date: Fri, 3 Dec 2004 16:14:06 +0100
Message-ID: <1f2ed5cd0412030714618185b6@mail.gmail.com>
To: John Fletcher <J.P.Fletcher@aston.ac.uk>
Cc: www-rdf-interest@w3.org

On Fri, 03 Dec 2004 14:08:20 -0000, John Fletcher
<J.P.Fletcher@aston.ac.uk> wrote:
> 
> Are there any tools which I could use to extract link information
> from web pages in the form of RDF?
> 
> I have a large personal wiki with over 1000 pages of information
> which I have built up over a long period.  The pages are produced
> as HTML "on the fly" when needed.

How are the pages (and links) currently stored? If it's in a RDB then
the output could be templated as RDF/XML rather than HTML. Rather than
doing any real conversion as such, you could effectively project an
RDF view of the existing store.

An alternative would be to run through the pages, clean each in turn
to XHTML (e.g. with HTML Tidy) and then apply XSLT to that, again
getting RDF/XML.

> I would like to update the tools to use an RDF aware wiki system,
> but would need to extract the information from the existing wiki.

Platypus Wiki is definitely worth a look in this space:

http://platypuswiki.sourceforge.net/

There's a mod_wiki vocab (designed for use with RSS) at:

http://www.usemod.com/cgi-bin/mb.pl?ModWiki

I had a play with RDF+Wiki a while ago (got so far and got distracted,
though I still use the basic Wiki locally as a personal notepad -
strongly recommended), notes mostly around:

http://dannyayers.com/index.php?s=stiki+rdf&submit=Search+Archives

Cheers,
Danny.

-- 

http://dannyayers.com
Received on Friday, 3 December 2004 15:14:09 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:52:12 GMT