- From: Peter Ansell <ansell.peter@gmail.com>
- Date: Tue, 15 Jun 2010 08:15:53 +1000
- To: "Stuart A. Yeates" <syeates@gmail.com>
- Cc: nathan@webr3.org, Linked Data community <public-lod@w3.org>
On 15 June 2010 07:09, Stuart A. Yeates <syeates@gmail.com> wrote: > On Thu, May 13, 2010 at 6:43 PM, Nathan <nathan@webr3.org> wrote: >> Thus, do we currently have, or can we find a single, simple way to express >> that document X contains further information for subject Y that primarily >> uses the predicate Z. > > I'm not certain, but I'm pretty sure that this should be: "document X > contains further information for subject Y that focuses on the > predicate Z" > > "primarily uses" is dangerous because many data representations end up > primarily using the very common predicates from the rdf: rdfs: and dc > namespaces. > > In information retrieval terms, what would be more useful is a tf-idf > approach (see http://en.wikipedia.org/wiki/Tf%E2%80%93idf ). TF-IDF is useful for ranking a large number of documents where you don't have any access to semantic information so you are relying completely on reference counts. In this case, there may not be a large number of mixed documents, as we may just be trying to split up a big document about a single topic into smaller documents, and the purpose of each of the smaller documents could easily be designated by the predicates that it contains. I would not view the words "primarily uses" as making it impossible for the document to use other predicates, or for other documents to use the same predicate. It is just like saying "the most common verb in this document is Z", without bias to other documents using the same description if necessary. Even in the case where we are trying to describe a large number of mixed documents that never originated from the same source, you could still use the same approach of focusing on the type of information that is contained in each document. It just wouldn't be as optimised as a split from a big document to targeted smaller documents. Cheers, Peter
Received on Monday, 14 June 2010 22:16:25 UTC