Re: Delegation and splitting the description of a subject over multiple document

On 15 June 2010 07:09, Stuart A. Yeates <syeates@gmail.com> wrote:
> On Thu, May 13, 2010 at 6:43 PM, Nathan <nathan@webr3.org> wrote:
>> Thus, do we currently have, or can we find a single, simple way to express
>> that document X contains further information for subject Y that primarily
>> uses the predicate Z.
>
> I'm not certain, but I'm pretty sure that this should be: "document X
> contains further information for subject Y that focuses on the
> predicate Z"
>
> "primarily uses" is dangerous because many data representations end up
> primarily using the very common predicates from the rdf: rdfs: and dc
> namespaces.
>
> In information retrieval terms, what would be more useful is a tf-idf
> approach (see http://en.wikipedia.org/wiki/Tf%E2%80%93idf ).

TF-IDF is useful for ranking a large number of documents where you
don't have any access to semantic information so you are relying
completely on reference counts. In this case, there may not be a large
number of mixed documents, as we may just be trying to split up a big
document about a single topic into smaller documents, and the purpose
of each of the smaller documents could easily be designated by the
predicates that it contains.

I would not view the words "primarily uses" as making it impossible
for the document to use other predicates, or for other documents to
use the same predicate. It is just like saying "the most common verb
in this document is Z", without bias to other documents using the same
description if necessary.

Even in the case where we are trying to describe a large number of
mixed documents that never originated from the same source, you could
still use the same approach of focusing on the type of information
that is contained in each document. It just wouldn't be as optimised
as a split from a big document to targeted smaller documents.

Cheers,

Peter

Received on Monday, 14 June 2010 22:16:25 UTC