- From: David R. Karger <karger@theory.lcs.mit.edu>
- Date: Mon, 7 Apr 2003 01:51:31 -0400
- To: Mark_Butler@hplb.hpl.hp.com
- CC: www-rdf-dspace@w3.org
From: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com> Date: Fri, 4 Apr 2003 15:52:17 +0100 X-MailScanner: Found to be clean X-Mailing-List: <www-rdf-dspace@w3.org> archive/latest/114 X-Loop: www-rdf-dspace@w3.org David, Firstly thanks for your thoughtful feedback - I have incorporated it in the issues list. As I authored some of the material in the document, some of the issues here relate to things I wrote. So I want to explain them in a bit more detail. The other thing, which caused me some frustration when Mick and I put together the current document, is presently we don't have a shared terminology here. I instigated the use of MindManager (a mind mapping tool) in creating the document, but the way we used this wasn't ideal. If we had all sat in the same room together during the knowledge capture, then a shared terminology would have started to fall out of this process. However in practice we had several point interactions between subsets of the team, so this didn't happen. So now we may need to do some work to ensure we arrive at that shared terminology. Here I try to give some background about why I'm using the terminology in a certain way, in the hope this will help. Be warned though some of my views are at odds to some other people here, but I want to make them explicit - hopefully we can do this without starting a flame war. Apologies for bluntness in advance. > "Metadata Augmentation" seems odd to restrict to human beings. Surely > agents can also augment metadata. Okay, so I put this restriction in. However first another terminology question: I'm afraid I'm not fond of the term "agents" because it has never been clear to me what the difference is between "agents" and "programs". The use of the term seems to be an unnecessary anthropomorphism i.e. people are much happier talking about things as if they were conscious entities, even when they are clearly not. Now if the term agent was used solely to refer to a particular programming paradigm e.g. Beliefs-Desires-Intentions then it would have more currency, but it is used for a whole range of approaches, some of which are just standard procedural approaches e.g. a program that does a web search, a component in a black-board architecture, or an "agent-based model" i.e. bottom-up, object oriented model. This problem with the terminology has been widely noted in the field e.g. the Jennings and Wooldridge paper. So when you say "agent", what exactly do you mean? Can we replace it with a more precise term, or are we stuck with agent? I'm quite agnostic. For me agent is not a precise technical term---i just say agent when I think of a piece of software that user leaves running to do things for them, acting on new informatino, without much input from the user. Now to the restriction. When I wrote this section, it wasn't clear to me how "metadata augmentation" differed from standard data processing and analysis techniques such as - feature extraction - machine learning e.g. clustering or categorisation - rule processing so I discussed it with Mick and the one thing that didn't seem to fit in the categories above was human annotation. So my questions are: i) do you think metadata augmentation is different to these processes? If so, can you outline a definition? I think that each of these processes can be used for metadata augmentation, but that none of them IS metadata augmentation. I read metadata augmentation in plain english: it is the process of adding metadata to an object. I do sense a distinction between metadata augmentation and extraction: extraction is introspective, looking at the object itself in order to deduce metadat about it; augmentation involves finding and attaching metadata from some other source. Note this is a fuzzy distinction, not formal, but I think that is reasonable for the given goal of identifying research directions. > We should add the topic of "Metadata presentation" > > "It is essential that instance validation be performed to guard > against errors and inconsistencies". I question essentialness, and > have more of a "best effort" attitude. It will be useful to validate > instances, but at the scale we intend to work I doubt we can get it > completely right. Okay, again this comes from me. There are two points of view here: the "open schemas" point of view that Dave Reynolds has detailed, and the "garbage In, garbage out" point of view which is where I am aligned. I need to give some background here: over the past two years, I've been working on CC/PP and UAProf, which are two early applications of RDF. The main conclusion of that work has been that these standards were very hard to deploy as a result of using RDF/XML rather than XML. One of the stated aims of SIMILE is to investigate whether the SW tools work, but to be honest I have a frustration here: as I've already demonstrated several difficulties with using these tools and techniques with a much simpler use case than SIMILE, I'm afraid I'm not hopeful about the outcome of SIMILE. I'd much prefer to see these issues addressed first. However as these are standards issues, resolving them is more a political than a technical process. My work on CC/PP is documented in technical reports available from my web page and I recently wrote a paper that summarises these results for the SW community. However the review comments I've had on this paper back have stated that the paper is more about problems with CC/PP rather than RDF, and that maybe the real issue here is that RDF was the wrong choice for CC/PP. I disagree with this and I think the problems encountered in CC/PP are very likely to occur in other applications of RDF. I don't want to go into all the issues in detail here - I have a presentation that outlines them - but one lesson from the CC/PP work is an old rule of thumb in CS: if we are entering data into machines it essential to have some form of validation. In the RDF world, this doesn't just mean validating that documents conform to RDF/XML, it also means validating class and property names to guard against spelling mistakes or capitalization errors. Also where possible, it is very desirable to check that property values conform to set vocabulary. However this is at odds to "open schemas" - in fact some one from the RDF community said that validating in this way was "against the spirit of RDF". However I don't agree with this as I think we need both - some kind of closed schema assumption at ingestion, but then an open schema assumption when we start to machine process the schemas. One problem here is RDF doesn't provide any languages or tools for doing validation, but as RDF/XML uses multiple XML serialisations it is not possible to leverage existing XML validation work. In CC/PP and UAProf, we have found that in the absence of validation profile errors are widespread and often result in profiles being unprocessable. We've done some work building validating processors which have resolved some of the problems, but such solutions are ugly when they are so easily solvable in XML. However I have a frustration here because the RDF Community really don't see this as a problem. So perhaps the text in the document is therefore reflecting my frustration here so we need something more balanced - perhaps this is better? "It is desirable that instance validation is performed on human entered data to guard against errors and inconsistencies. However there are limits to the type of validation that can be performed: for example with a controlled vocabulary, we can validate a property value conforms to the vocabulary but we cannot validate that it accurately reflects the real world." My position arises from my focus on human interaction with metadata. Validation is great. But any that forbids human beings from entering the data they want to enter because it isn't "valid" is a system humans will be unwilling to use. I don't have any objection to validation being performed; my concern is that when validation fails, rejecting the data is not a valid response. I see schemas as "useful representations of common cases to which certain behaviors might apply". > 3.2.6 > > I am made nervous by the discussion of inferencing search. This > formulation takes a tractable database problem and turns it into an > NP-hard or even undecidable computational nightmare. I'm not an expert on Description Logics, but I thought the point of DL's were they ensured the problems were not NP-hard (although Jeremy Carroll, our OWL WG member here at HP has expressed some concern recently about whether this is true in the current version of OWL). Have I misunderstood something here? I came up with what I think is a helpful analogy here: we can consider description logics and inferencing as a form of information compression i.e. we can reduce a set of triples to a compressed form (the ontology) and then we can retrieve the uncompressed version (the ontology and the entailments) via an algorithm. We perform the compression for two reasons: firstly to simplify the task of description, and secondly to reduce the amount of memory required for representation. Of course, when we need that information, this means we are faced with a trade-off between processing cost and memory cost during decompression i.e. do we do it once and keep it in memory or do we do it on demand. This is another reason for wanting to keep our A-Box and T-Box concerns separate: if they are separate, then we can perform this compression, but if some of the T-Box information is in the instance data (for example datatype information), we can't compress it in the same way. I've been running into a lot of trouble over the meaning of "ontology", so I'm not even going to try to deal with it but just ask you to define exactly what you mean by it. In this age of infinite storage (or so we'd prefer to think) there doesn't seem to be much reason to compress (discard information). Rather, there's a question of how much expansion we want to do "in advance" and how much "on demand". Purely from a simplicity perspective, I lean towards "in advance". Much as with distributed resources (discussed later), I think we have a huge number of unsolved questions, even if we avoid all on demand inferencing and settle for using the information available at query time. Again, I'm not fanatical on this; i recognize that occasionally inference will be useful. Just want to emphasize the alternative. Indeed, my sense of the word "search" is that it focuses on the already-present information. However I must admit I'm made nervous of the use of the term inference here also, because in the machine learning community inference means something quite different. OilEd uses the term classification which seems more natural. The ML and logic communities have distinct and equalily valid notions of inference. I don't care too much about what term we pick. Again mentioning inferencing search may be biased by my views: based on my work on CC/PP, I think if you just compare RDF/XML and XML as data formats, XML is much better because it is more mature and there are few if any reasons to use RDF. In my view, this is why RDF has not had the widespread adoption the W3C would like. This problem can be solved by reformulating RDF/XML so that it is compatible with XML tools, so that users don't have to choose XML or RDF, they can choose a subset of XML that is fully processable by XML and RDF tools. From my perspective, the reason to use RDF is that it is a convenient model in which to represent semantic networks. Doing this in plain XML, which is really targeted at hierarchical data representations, is much more of a hassle. Therefore in SIMILE, if we are to find compelling reasons to use SW techniques, I think we need to look further up the SW stack. Furthermore it seems to me that RDFS is largely redundant as OWL can do everything RDFS can, so it seems this means looking at OWL (or, for while it is under development, DAML+OIL). However others here may not agree with that viewpoint. Again, from a human perspective, there are certain issues (eg, synonymous properties) that are (i) concievably solvable without complex inference and (ii) seem very likely to arise when we let everyone coin their own ontologies. To return to your point though, would you be happier if we had a section called "search" where "inferencing search" was one of several subsections? If so what would the other subsections be? > Equivalence does feel like a "safe" specialized form of inference > worth giving attention to. Yes I'd agree with you here. In CC/PP we found equivalence was essential when we started to use multiple versions of vocabularies. Again this takes me back to RDFS. If RDFS offered equivalence, then there would a set of tasks you could perform with just RDFS so they might be a reason for modularising OWL to have an RDFS subset. Without equivalence, RDFS is very limited. > 3.4 > > Distributed resources adds whole different scope. It opens up a host > of nasty problems of course. We could avoid them by limiting our > dealing of with distributed metadata to devising a simple > block-transfer protocol, getting all the metadata to a single > location, and dealing with it there. The metadata might not be fully > up to date, but it avoids a lot of trouble. Since even in centralized > scenario everything is hard, perhaps we defer distributed search? Yes, I'd agree here although Mick notes that for the Genesis folks this may be their particular domain of interest. So this may be something we need to discuss further at the PI meeting. > 4.6 > > It is definitely important and interesting, but web site ingest feels > less within simile scope that all the other use cases---perhaps > because of its lack of metadata issues. Seems more like a straight > preservation task? Again this is something we need to discuss at the PI meeting, because website ingest maps onto several of MacKenzie's use cases, but I'd agree that it lacks metadata issues. > 4.7 > > While human-opinion metadata is mentioned, the bulkd of this item > talks about "usage based" metadata like query history. these are > quite different-eg, usage based is collected without user noticing, > while opinion-metadata needs interface that let user record such. Thanks, your distinction between the two was helpful to me here. Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Monday, 7 April 2003 01:47:26 UTC