W3C home > Mailing lists > Public > www-rdf-dspace@w3.org > June 2003

RE: metadata processing

From: Seaborne, Andy <Andy_Seaborne@hplb.hpl.hp.com>
Date: Wed, 18 Jun 2003 17:25:01 +0100
Message-ID: <5E13A1874524D411A876006008CD059F06C240D4@0-mail-1.hpl.hp.com>
To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>, "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org>

Mark,

That's a useful split.  Cleaning is a commonly need in data warehousing so
it seems likely to be a requirement for some use cases at least.

It would fit with the proposed splitting of issues into
Content/Metadata/Vocabulary.  My reading of "3.9 Processing Models" is that
it was about vocabulary/schema level issues as is, mainly, "3.3 Information
Lifecycle" although the lifecycle have a certain commonality.  We could have
a metadata lifecycle as well as content and vocabulary lifecycles in each of
the respective sections.

Whether this is the lifecycle right split, I will leave to people better
informed about the requirements for this domain,  but I observe that section
"3 Metadata" of your original text comes close to the stages below with the
addition idea that metadata may be further modified, (described under
'augmentation') and it is not restricted to the original creation/addition
stage.  Architecturally, we have a processing model for metadata coming in
and of metadata services that change and manage metadata already in the
system and it would be good if these were not disjoint.

	Andy

-----Original Message-----
From: Butler, Mark [mailto:Mark_Butler@hplb.hpl.hp.com] 
Sent: 18 June 2003 16:42
To: www-rdf-dspace@w3.org
Subject: metadata processing



Hi Team,

I found a description of metadata processing here
http://www.techquila.com/mdf.html

"The driving concept behing MDF is that the processing of metadata involves
a number of different stages. Depending on the source and eventual usage of
the metadata any one or all four of the following stages may be required:

Discovery: the act of trawling some resource set for metadata resources
(which may or may not be combined with the content the metadata describes). 
Extraction: the retrieval of metadata from some set of resources. 
Cleaning: the processing of metadata from its retrieved format into a format
which is consistent with the final application. This may include lexical
processing, reformatting of data and/or the combining of multiple diverse
metadata vocabularies into a single consistent vocabulary. 
Aggregation: the storing of the cleaned metadata together with other
similarly processed metadata. 
Within each of these stages, there are any number of different approaches
which could be taken. For example, discovery could be by web-crawling, by
executing searches or by recursing through file system directories.
Extraction may require processing specific to the format of the resource
retrieved. Cleaning could involve simple lexical processing (such as forcing
all strings to a single case or splitting a string on particular boundaries)
or complex extraction processing (such as named entity recognition on text).
Finally the aggregation step might write RDF; a topic map in the XTM
interchange syntax; a topic map in ISO 13250; or might be used to update a
database or other datastore.

MDF attempts to improve the reusability of the different processing
functions for each of these stages by defining a framework in which the
functions may be designed and implemented separately and then linked
together in any combination to provide the desired processing."


that I thought made a lot of sense so I wondered if it is worth describing
this discovery, extraction, cleaning and aggregation model in section 3 of
the "relevant technologies document"?

As Andy notes, my description of "processing models" really only
concentrates on the discovery section of metadata processing.

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Wednesday, 18 June 2003 12:25:22 EDT

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 13:35:24 EDT