re: data smushing from David Megginson on 2000-12-30 (www-rdf-interest@w3.org from December 2000)

From: David Megginson <david@megginson.com>
Date: Sat, 30 Dec 2000 07:21:33 -0500 (EST)
To: xml-dev@lists.xml.org, www-rdf-interest@w3.org
Message-ID: <14925.54093.932391.700154@megginson.com>

Seth Russell writes:

 > So true, the Semantic Web doesn't work without "data smushing"!  I
 > think we should even apply "data smushing" to nodes with URIs,
 > cause there gonna be people misapplying URIs.  My question is: has
 > anybody come up with some good algorithms for "data smushing" ?  (I
 > love that term, I've used it 3 times now.)  Maybe we should come up
 > with a schema for expressing smushing rules in RDF ... any hint of
 > that being done yet?

There are two separate problems here:

1. combining data from two different sources; and

2. pruning redundant entities.

It may be the case that the different sources use the same URI to
identify the same entity; likewise, a single source with a large
database might end up with many duplicate versions of the same entity
shadowing each other.

Outside the research lab, #2 is extremely difficult.  For #1, however,
all we have to do is extend the (oversimplified version of the) RDF
logical model to include one more member:

  {predicate, subject, object, source}

where source is a URI representing the source of the information
(probably, but not necessarily, the URL of an RDF document; it could
also be a URI representing a news wire, for example).  Now, query
operations, searches, etc. can take into account where the information
came from, and can distinguish, say, two "name" properties provided by
the same source from two "name" properties provided by two different
sources.

<rant>

As I've mentioned many times before, the published RDF logical model
needs to be extended anyway because it does not distinguish specific
subjects from open-ended subject patterns (rdf:aboutEachPrefix), it
does not distinguish literal objects from resource objects, and it
does not allow for xml:lang (which the RDF spec states is significant
in RDF processing).  A logical model that takes all of this into
account would look something like

  {predicate, subject, subjectType, object, objectType, lang}

or, with the source information

  {predicate, subject, subjectType, object, objectType, lang, source}

You could argue that subject type is an internal trait of subject, and
that objectType and lang are internal traits of the object, but then
the grammar needs to be elaborated properly:

  statement: predicate, subject, object

  predicate: URI

  subject: URI, subjectType

  subjectType: ("uri" | "pattern")

  object: URI, objectType, lang

  objectType: ("literal" | "resource")

  lang: LITERAL

It's still not all that bad, but the

  {predicate, subject, object}

thing was always bogus.

</rant>

All the best,

David

-- 
David Megginson                 david@megginson.com
           http://www.megginson.com/

Received on Saturday, 30 December 2000 11:55:04 UTC