Re: notes on use cases from David R. Karger on 2003-04-07 (www-rdf-dspace@w3.org from April 2003)

From: David R. Karger <karger@theory.lcs.mit.edu>
Date: Mon, 7 Apr 2003 01:51:31 -0400
To: Mark_Butler@hplb.hpl.hp.com
CC: www-rdf-dspace@w3.org
Message-Id: <200304070551.h375pVO4001760@harrier.lcs.mit.edu>
   From: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
   Date: Fri, 4 Apr 2003 15:52:17 +0100 
   X-MailScanner: Found to be clean
   X-Mailing-List: <www-rdf-dspace@w3.org> archive/latest/114
   X-Loop: www-rdf-dspace@w3.org


   David, 

   Firstly thanks for your thoughtful feedback - I have incorporated it in the
   issues list.

   As I authored some of the material in the document, some of the issues here
   relate to things I wrote. So I want to explain them in a bit more detail.

   The other thing, which caused me some frustration when Mick and I put
   together the current document, is presently we don't have a shared
   terminology here. I instigated the use of MindManager (a mind mapping tool)
   in creating the document, but the way we used this wasn't ideal. If we had
   all sat in the same room together during the knowledge capture, then a
   shared terminology would have started to fall out of this process. However
   in practice we had several point interactions between subsets of the team,
   so this didn't happen. So now we may need to do some work to ensure we
   arrive at that shared terminology. 

   Here I try to give some background about why I'm using the terminology in a
   certain way, in the hope this will help. Be warned though some of my views
   are at odds to some other people here, but I want to make them explicit -
   hopefully we can do this without starting a flame war. Apologies for
   bluntness in advance. 

   > "Metadata Augmentation" seems odd to restrict to human beings.  Surely
   > agents can also augment metadata.  

   Okay, so I put this restriction in. However first another terminology
   question: I'm afraid I'm not fond of the term "agents" because it has never
   been clear to me what the difference is between "agents" and "programs". The
   use of the term seems to be an unnecessary anthropomorphism i.e. people are
   much happier talking about things as if they were conscious entities, even
   when they are clearly not. Now if the term agent was used solely to refer to
   a particular programming paradigm e.g. Beliefs-Desires-Intentions then it
   would have more currency, but it is used for a whole range of approaches,
   some of which are just standard procedural approaches e.g. a program that
   does a web search, a component in a black-board architecture, or an
   "agent-based model" i.e. bottom-up, object oriented model. This problem with
   the terminology has been widely noted in the field e.g. the Jennings and
   Wooldridge paper. 

   So when you say "agent", what exactly do you mean? Can we replace it with a
   more precise term, or are we stuck with agent?

I'm quite agnostic.  For me agent is not a precise technical term---i
just say agent when I think of a piece of software that user leaves
running to do things for them, acting on new informatino, without much
input from the user.

   Now to the restriction. When I wrote this section, it wasn't clear to me how
   "metadata augmentation" differed from standard data processing and analysis
   techniques such as
   - feature extraction 
   - machine learning e.g. clustering or categorisation
   - rule processing
   so I discussed it with Mick and the one thing that didn't seem to fit in the
   categories above was human annotation. 

   So my questions are:
   i) do you think metadata augmentation is different to these processes? If
   so, can you outline a definition?

I think that each of these processes can be used for metadata
augmentation, but that none of them IS metadata augmentation.  I read
metadata augmentation in plain english: it is the process of adding
metadata to an object.  I do sense a distinction between metadata
augmentation and extraction: extraction is introspective, looking at
the object itself in order to deduce metadat about it; augmentation
involves finding and attaching metadata from some other source.  Note
this is a fuzzy distinction, not formal, but I think that is
reasonable for the given goal of identifying research directions.

   > We should add the topic of "Metadata presentation"
   > 
   > "It is essential that instance validation be performed to guard
   > against errors and inconsistencies".  I question essentialness, and
   > have more of a "best effort" attitude.  It will be useful to validate
   > instances, but at the scale we intend to work I doubt we can get it
   > completely right.

   Okay, again this comes from me. There are two points of view here: the "open
   schemas" point of view that Dave Reynolds has detailed, and the "garbage In,
   garbage out" point of view which is where I am aligned. 

   I need to give some background here: over the past two years, I've been
   working on CC/PP and UAProf, which are two early applications of RDF. The
   main conclusion of that work has been that these standards were very hard to
   deploy as a result of using RDF/XML rather than XML. One of the stated aims
   of SIMILE is to investigate whether the SW tools work, but to be honest I
   have a frustration here: as I've already demonstrated several difficulties
   with using these tools and techniques with a much simpler use case than
   SIMILE, I'm afraid I'm not hopeful about the outcome of SIMILE. I'd much
   prefer to see these issues addressed first. However as these are standards
   issues, resolving them is more a political than a technical process. 

   My work on CC/PP is documented in technical reports available from my web
   page and I recently wrote a paper that summarises these results for the SW
   community. However the review comments I've had on this paper back have
   stated that the paper is more about problems with CC/PP rather than RDF, and
   that maybe the real issue here is that RDF was the wrong choice for CC/PP. 

   I disagree with this and I think the problems encountered in CC/PP are very
   likely to occur in other applications of RDF. I don't want to go into all
   the issues in detail here - I have a presentation that outlines them - but
   one lesson from the CC/PP work is an old rule of thumb in CS: if we are
   entering data into machines it essential to have some form of validation. In
   the RDF world, this doesn't just mean validating that documents conform to
   RDF/XML, it also means validating class and property names to guard against
   spelling mistakes or capitalization errors. Also where possible, it is very
   desirable to check that property values conform to set vocabulary. However
   this is at odds to "open schemas" - in fact some one from the RDF community
   said that validating in this way was "against the spirit of RDF". However I
   don't agree with this as I think we need both - some kind of closed schema
   assumption at ingestion, but then an open schema assumption when we start to
   machine process the schemas. 

   One problem here is RDF doesn't provide any languages or tools for doing
   validation, but as RDF/XML uses multiple XML serialisations it is not
   possible to leverage existing XML validation work. In CC/PP and UAProf, we
   have found that in the absence of validation profile errors are widespread
   and often result in profiles being unprocessable. We've done some work
   building validating processors which have resolved some of the problems, but
   such solutions are ugly when they are so easily solvable in XML. However I
   have a frustration here because the RDF Community really don't see this as a
   problem. So perhaps the text in the document is therefore reflecting my
   frustration here so we need something more balanced - perhaps this is
   better?

   "It is desirable that instance validation is performed on human entered data
   to guard against errors and inconsistencies. However there are limits to the
   type of validation that can be performed: for example with a controlled
   vocabulary, we can validate a property value conforms to the vocabulary but
   we cannot validate that it accurately reflects the real world."

My position arises from my focus on human interaction with metadata.
Validation is great.  But any that forbids human beings from entering
the data they want to enter because it isn't "valid" is a system
humans will be unwilling to use.  

I don't have any objection to validation being performed; my concern
is that when validation fails, rejecting the data is not a valid
response.  I see schemas as "useful representations of common cases to
which certain behaviors might apply".

   > 3.2.6 
   > 
   > I am made nervous by the discussion of inferencing search.  This
   > formulation takes a tractable database problem and turns it into an
   > NP-hard or even undecidable computational nightmare.  

   I'm not an expert on Description Logics, but I thought the point of DL's
   were they ensured the problems were not NP-hard (although Jeremy Carroll,
   our OWL WG member here at HP has expressed some concern recently about
   whether this is true in the current version of OWL). Have I misunderstood
   something here?

   I came up with what I think is a helpful analogy here: we can consider
   description logics and inferencing as a form of information compression i.e.
   we can reduce a set of triples to a compressed form (the ontology) and then
   we can retrieve the uncompressed version (the ontology and the entailments)
   via an algorithm. We perform the compression for two reasons: firstly to
   simplify the task of description, and secondly to reduce the amount of
   memory required for representation. Of course, when we need that
   information, this means we are faced with a trade-off between processing
   cost and memory cost during decompression i.e. do we do it once and keep it
   in memory or do we do it on demand. This is another reason for wanting to
   keep our A-Box and T-Box concerns separate: if they are separate, then we
   can perform this compression, but if some of the T-Box information is in the
   instance data (for example datatype information), we can't compress it in
   the same way. 

I've been running into a lot of trouble over the meaning of
"ontology", so I'm not even going to try to deal with it but just ask
you to define exactly what you mean by it.

In this age of infinite storage (or so we'd prefer to think) there
doesn't seem to be much reason to compress (discard information).
Rather, there's a question of how much expansion we want to do "in
advance" and how much "on demand".  Purely from a simplicity
perspective, I lean towards "in advance".  Much as with distributed
resources (discussed later), I think we have  a huge number of
unsolved questions, even if we avoid all on demand inferencing and
settle for using the information available at query time.

Again, I'm not fanatical on this; i recognize that occasionally
inference will be useful.  Just want to emphasize the alternative.
Indeed, my sense of the word "search" is that it focuses on the
already-present information.

   However I must admit I'm made nervous of the use of the term inference here
   also, because in the machine learning community inference means something
   quite different. OilEd uses the term classification which seems more
   natural. 

The ML and logic communities have distinct and equalily valid notions
of inference.  I don't care too much about what term we pick.

   Again mentioning inferencing search may be biased by my views: based on my
   work on CC/PP, I think if you just compare RDF/XML and XML as data formats,
   XML is much better because it is more mature and there are few if any
   reasons to use RDF. In my view, this is why RDF has not had the widespread
   adoption the W3C would like. This problem can be solved by reformulating
   RDF/XML so that it is compatible with XML tools, so that users don't have to
   choose XML or RDF, they can choose a subset of XML that is fully processable
   by XML and RDF tools. 

From my perspective, the reason to use RDF is that it is a convenient
model in which to represent semantic networks.  Doing this in plain
XML, which is really targeted at hierarchical data representations, is
much more of a hassle.

   Therefore in SIMILE, if we are to find compelling reasons to use SW
   techniques, I think we need to look further up the SW stack. Furthermore it
   seems to me that RDFS is largely redundant as OWL can do everything RDFS
   can, so it seems this means looking at OWL (or, for while it is under
   development, DAML+OIL). However others here may not agree with that
   viewpoint. 

Again, from a human perspective, there are certain issues (eg,
synonymous properties) that are (i) concievably solvable without
complex inference and (ii) seem very likely to arise when we let
everyone coin their own ontologies.

   To return to your point though, would you be happier if we had a section
   called "search" where "inferencing search" was one of several subsections?
   If so what would the other subsections be?

   > Equivalence does feel like a "safe" specialized form of inference
   > worth giving attention to.

   Yes I'd agree with you here. In CC/PP we found equivalence was essential
   when we started to use multiple versions of vocabularies. Again this takes
   me back to RDFS. If RDFS offered equivalence, then there would a set of
   tasks you could perform with just RDFS so they might be a reason for
   modularising OWL to have an RDFS subset. Without equivalence, RDFS is very
   limited. 

   > 3.4
   > 
   > Distributed resources adds whole different scope.  It opens up a host
   > of nasty problems of course.  We could avoid them by limiting our
   > dealing of with distributed metadata to devising a simple
   > block-transfer protocol, getting all the metadata to a single
   > location, and dealing with it there.  The metadata might not be fully
   > up to date, but it avoids a lot of trouble.  Since even in centralized
   > scenario everything is hard, perhaps we defer distributed search?

   Yes, I'd agree here although Mick notes that for the Genesis folks this may
   be their particular domain of interest. So this may be something we need to
   discuss further at the PI meeting. 

   > 4.6
   > 
   > It is definitely important and interesting, but web site ingest feels
   > less within simile scope that all the other use cases---perhaps
   > because of its lack of metadata issues.  Seems more like a straight
   > preservation task?

   Again this is something we need to discuss at the PI meeting, because
   website ingest maps onto several of MacKenzie's use cases, but I'd agree
   that it lacks metadata issues. 

   > 4.7 
   > 
   > While human-opinion metadata is mentioned, the bulkd of this item
   > talks about "usage based" metadata like query history.  these are
   > quite different-eg, usage based is collected without user noticing,
   > while opinion-metadata needs interface that let user record such. 

   Thanks, your distinction between the two was helpful to me here.

   Dr Mark H. Butler
   Research Scientist                HP Labs Bristol
   mark-h_butler@hp.com
   Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Monday, 7 April 2003 01:47:26 UTC