RE: notes on use cases from Butler, Mark on 2003-04-04 (www-rdf-dspace@w3.org from April 2003)

From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
Date: Fri, 4 Apr 2003 15:52:17 +0100
To: "'SIMILE public list'" <www-rdf-dspace@w3.org>
Message-ID: <5E13A1874524D411A876006008CD059F066A1BD9@0-mail-1.hpl.hp.com>
David, 

Firstly thanks for your thoughtful feedback - I have incorporated it in the
issues list.

As I authored some of the material in the document, some of the issues here
relate to things I wrote. So I want to explain them in a bit more detail.

The other thing, which caused me some frustration when Mick and I put
together the current document, is presently we don't have a shared
terminology here. I instigated the use of MindManager (a mind mapping tool)
in creating the document, but the way we used this wasn't ideal. If we had
all sat in the same room together during the knowledge capture, then a
shared terminology would have started to fall out of this process. However
in practice we had several point interactions between subsets of the team,
so this didn't happen. So now we may need to do some work to ensure we
arrive at that shared terminology. 

Here I try to give some background about why I'm using the terminology in a
certain way, in the hope this will help. Be warned though some of my views
are at odds to some other people here, but I want to make them explicit -
hopefully we can do this without starting a flame war. Apologies for
bluntness in advance. 

> "Metadata Augmentation" seems odd to restrict to human beings.  Surely
> agents can also augment metadata.  

Okay, so I put this restriction in. However first another terminology
question: I'm afraid I'm not fond of the term "agents" because it has never
been clear to me what the difference is between "agents" and "programs". The
use of the term seems to be an unnecessary anthropomorphism i.e. people are
much happier talking about things as if they were conscious entities, even
when they are clearly not. Now if the term agent was used solely to refer to
a particular programming paradigm e.g. Beliefs-Desires-Intentions then it
would have more currency, but it is used for a whole range of approaches,
some of which are just standard procedural approaches e.g. a program that
does a web search, a component in a black-board architecture, or an
"agent-based model" i.e. bottom-up, object oriented model. This problem with
the terminology has been widely noted in the field e.g. the Jennings and
Wooldridge paper. 

So when you say "agent", what exactly do you mean? Can we replace it with a
more precise term, or are we stuck with agent?

Now to the restriction. When I wrote this section, it wasn't clear to me how
"metadata augmentation" differed from standard data processing and analysis
techniques such as
- feature extraction 
- machine learning e.g. clustering or categorisation
- rule processing
so I discussed it with Mick and the one thing that didn't seem to fit in the
categories above was human annotation. 

So my questions are:
i) do you think metadata augmentation is different to these processes? If
so, can you outline a definition?

ii) alternatively do you think metadata augmentation subsumes these
processes i.e. metadata extraction, data mining and metadata generation are
all types of metadata augmentation? Alternatively these may not be the best
subheadings here - perhaps we can agree on better ones?

> We should add the topic of "Metadata presentation"
> 
> "It is essential that instance validation be performed to guard
> against errors and inconsistencies".  I question essentialness, and
> have more of a "best effort" attitude.  It will be useful to validate
> instances, but at the scale we intend to work I doubt we can get it
> completely right.

Okay, again this comes from me. There are two points of view here: the "open
schemas" point of view that Dave Reynolds has detailed, and the "garbage In,
garbage out" point of view which is where I am aligned. 

I need to give some background here: over the past two years, I've been
working on CC/PP and UAProf, which are two early applications of RDF. The
main conclusion of that work has been that these standards were very hard to
deploy as a result of using RDF/XML rather than XML. One of the stated aims
of SIMILE is to investigate whether the SW tools work, but to be honest I
have a frustration here: as I've already demonstrated several difficulties
with using these tools and techniques with a much simpler use case than
SIMILE, I'm afraid I'm not hopeful about the outcome of SIMILE. I'd much
prefer to see these issues addressed first. However as these are standards
issues, resolving them is more a political than a technical process. 

My work on CC/PP is documented in technical reports available from my web
page and I recently wrote a paper that summarises these results for the SW
community. However the review comments I've had on this paper back have
stated that the paper is more about problems with CC/PP rather than RDF, and
that maybe the real issue here is that RDF was the wrong choice for CC/PP. 

I disagree with this and I think the problems encountered in CC/PP are very
likely to occur in other applications of RDF. I don't want to go into all
the issues in detail here - I have a presentation that outlines them - but
one lesson from the CC/PP work is an old rule of thumb in CS: if we are
entering data into machines it essential to have some form of validation. In
the RDF world, this doesn't just mean validating that documents conform to
RDF/XML, it also means validating class and property names to guard against
spelling mistakes or capitalization errors. Also where possible, it is very
desirable to check that property values conform to set vocabulary. However
this is at odds to "open schemas" - in fact some one from the RDF community
said that validating in this way was "against the spirit of RDF". However I
don't agree with this as I think we need both - some kind of closed schema
assumption at ingestion, but then an open schema assumption when we start to
machine process the schemas. 

One problem here is RDF doesn't provide any languages or tools for doing
validation, but as RDF/XML uses multiple XML serialisations it is not
possible to leverage existing XML validation work. In CC/PP and UAProf, we
have found that in the absence of validation profile errors are widespread
and often result in profiles being unprocessable. We've done some work
building validating processors which have resolved some of the problems, but
such solutions are ugly when they are so easily solvable in XML. However I
have a frustration here because the RDF Community really don't see this as a
problem. So perhaps the text in the document is therefore reflecting my
frustration here so we need something more balanced - perhaps this is
better?

"It is desirable that instance validation is performed on human entered data
to guard against errors and inconsistencies. However there are limits to the
type of validation that can be performed: for example with a controlled
vocabulary, we can validate a property value conforms to the vocabulary but
we cannot validate that it accurately reflects the real world."

> 3.2.6 
> 
> I am made nervous by the discussion of inferencing search.  This
> formulation takes a tractable database problem and turns it into an
> NP-hard or even undecidable computational nightmare.  

I'm not an expert on Description Logics, but I thought the point of DL's
were they ensured the problems were not NP-hard (although Jeremy Carroll,
our OWL WG member here at HP has expressed some concern recently about
whether this is true in the current version of OWL). Have I misunderstood
something here?

I came up with what I think is a helpful analogy here: we can consider
description logics and inferencing as a form of information compression i.e.
we can reduce a set of triples to a compressed form (the ontology) and then
we can retrieve the uncompressed version (the ontology and the entailments)
via an algorithm. We perform the compression for two reasons: firstly to
simplify the task of description, and secondly to reduce the amount of
memory required for representation. Of course, when we need that
information, this means we are faced with a trade-off between processing
cost and memory cost during decompression i.e. do we do it once and keep it
in memory or do we do it on demand. This is another reason for wanting to
keep our A-Box and T-Box concerns separate: if they are separate, then we
can perform this compression, but if some of the T-Box information is in the
instance data (for example datatype information), we can't compress it in
the same way. 

However I must admit I'm made nervous of the use of the term inference here
also, because in the machine learning community inference means something
quite different. OilEd uses the term classification which seems more
natural. 

Again mentioning inferencing search may be biased by my views: based on my
work on CC/PP, I think if you just compare RDF/XML and XML as data formats,
XML is much better because it is more mature and there are few if any
reasons to use RDF. In my view, this is why RDF has not had the widespread
adoption the W3C would like. This problem can be solved by reformulating
RDF/XML so that it is compatible with XML tools, so that users don't have to
choose XML or RDF, they can choose a subset of XML that is fully processable
by XML and RDF tools. 

Therefore in SIMILE, if we are to find compelling reasons to use SW
techniques, I think we need to look further up the SW stack. Furthermore it
seems to me that RDFS is largely redundant as OWL can do everything RDFS
can, so it seems this means looking at OWL (or, for while it is under
development, DAML+OIL). However others here may not agree with that
viewpoint. 

To return to your point though, would you be happier if we had a section
called "search" where "inferencing search" was one of several subsections?
If so what would the other subsections be?

> Equivalence does feel like a "safe" specialized form of inference
> worth giving attention to.

Yes I'd agree with you here. In CC/PP we found equivalence was essential
when we started to use multiple versions of vocabularies. Again this takes
me back to RDFS. If RDFS offered equivalence, then there would a set of
tasks you could perform with just RDFS so they might be a reason for
modularising OWL to have an RDFS subset. Without equivalence, RDFS is very
limited. 
 
> 3.3.1 Dissemination of information to humans.

Thanks for the references in section 3.3.1. 
 
> 3.4
> 
> Distributed resources adds whole different scope.  It opens up a host
> of nasty problems of course.  We could avoid them by limiting our
> dealing of with distributed metadata to devising a simple
> block-transfer protocol, getting all the metadata to a single
> location, and dealing with it there.  The metadata might not be fully
> up to date, but it avoids a lot of trouble.  Since even in centralized
> scenario everything is hard, perhaps we defer distributed search?

Yes, I'd agree here although Mick notes that for the Genesis folks this may
be their particular domain of interest. So this may be something we need to
discuss further at the PI meeting. 
 
> 4.6
> 
> It is definitely important and interesting, but web site ingest feels
> less within simile scope that all the other use cases---perhaps
> because of its lack of metadata issues.  Seems more like a straight
> preservation task?

Again this is something we need to discuss at the PI meeting, because
website ingest maps onto several of MacKenzie's use cases, but I'd agree
that it lacks metadata issues. 

> 4.7 
> 
> While human-opinion metadata is mentioned, the bulkd of this item
> talks about "usage based" metadata like query history.  these are
> quite different-eg, usage based is collected without user noticing,
> while opinion-metadata needs interface that let user record such. 

Thanks, your distinction between the two was helpful to me here.

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Friday, 4 April 2003 09:52:32 UTC