Research Topic -- Data Extraction and RDF

Hi, I'm new to the RDF arena but would like to find a research topic in this
area for my masters thesis.  My advisor and I have an idea, and we want to
run it by some other people who know what's going on in this area.  Here's
the idea:

Background:
Our research group (BYU's Data Extraction Group) has done a lot of work on
the automatic extraction of data from semistructured or unstructured
datasources (mainly web pages).  The way we do this is to first define a
domain dependent extraction ontology that describes the target schema of the
data as well as some keword and regular expression matching rules.  The we
can take a web page with data in that domain and extract it automatically
into a database.

Where RDF comes into it:
I'm thinking we could make a tool that takes an RDF Schema and
semi-automatically turns it into a data extraction ontology.  Then it would
use that ontology (also an RDF Schema) and use it to automatically extract
data from web pages. Finally, it would structure the data as RDF that could
be inserted into the header of the web page or kept in a repository
somewhere.

The idea is that the SW may be prevalent enough sometime in the future that
lots of data will be machine readable by design (i.e. not just thrown out on
the web in HTML for human consumption), but since that is clearly not the
case, we'd like to help it along a little by helping to automate the
conversion from human readable to machine readable.

Please comment on this idea.  Specifically:
	- Is anyone else doing anything similar?
	- Would this be a useful tool/technology?
	- Do you like it?
	- Our main concern is whether or not RDF is really meant to be used to
describe data
	  in general.  I know that it has a fairly rich way of creating conceptual
models
	  (Schemas), but most of the examples that are prevalent on the web give me
the 	  impression that RDF is meant to be used more for meta-data rather
than the data
	  itself.
	- Any other thought you have about this idea

Received on Wednesday, 30 January 2002 17:38:15 UTC