Re: Consolidating RDF XML data with different ids

On 1/22/06, David Pratt <fairwinds@eastlink.ca> wrote:

> Let's say I have a schema for automobiles where there are a number of
> properties that describe each car including model, manufacuter etc. I
> also have a schema for car parts also.

Ok, I'll have a quick crack going down into a bit of detail (probably
errors here but they usually get caught quickly on this list ;-)

Seems to me there are two strategies that stand out, direct and
indirect mapping between the different data schemes. I don't think
this depends too much on the particular modelling language, RDBMS
schema or UML might also do the same job (though not really XML
Schema, as that tells you about the syntax, not the things the
language describes).

Anyhow, let's say two companies have data about a particular model of
car. Corgi's id for a Lamborghini Marzal is "clm" and Matchbox's
"mlm". So Corgi's vocabulary might be translatable to something like
this:

[
a rdfs:Class ;
c:id "clm" ;
c:name "Lamborghini Marzal" .
]

and Matchbox's:

[
a rdfs:Class ;
m:id "mlm" ;
m:manufacturer "Lamborghini ;
m:model "Marzal" .
]

(Apologies if my n3/Turtle is out, but hopefully it's clear what's
intended - not sure about bnodes for classes either, assume an
arbitrary URI if necessary)

Now say a particular car known to Corgi is a Lamborghini Marzal, for
simplicities sake let's give it a URI:

<http://example.org/this-car> c:id "clm" .

A direct mapping here would be to say that the two classes of car are
somehow the same. Within this there are still a few options. There's
owl:sameAs, which is generally used to say two individuals are the
same, but we're talking classes here (it could be used this way in OWL
Full, but it doesn't really capture the right idea). If one class were
more general than the other (e.g. Matchbox's was only by manufacturer)
then one might say:

[
a rdfs:Class ;
c:id "clm" ;
c:name "Lamborghini Marzal" .
]
rdfs:subClassOf
[
a rdfs:Class ;
m:id "mlm" ;
m:manufacturer "Lamborghini .
]

(It would be likely that both c:id and m:id were inverse functional
properties, i.e. that they did disambiguate the class identification).

This would allow the inference that if a particular car was a
Lamborghini Marzal according to Corgi, it would also be a Lamborghini
according to Matchbox. But in this particular case a car that is a
Lamborghini Marzal according to Corgi will also be one according to
Matchbox. So one is a subClassOf the other and vice versa. A pair of
rdfs:subClassOf statements would say this, though there's shorthand in
the form of owl:equivalentClass.

So that's what I'll call direct mapping. Indirect mapping would be
through other terms/vocabularies, and one general solution would be to
use a terms which subsume those that need mapping. So -

:LM a rdfs:Class;
rdfs:subClassOf
[
a rdfs:Class ;
c:id "clm" .
];
rdfs:subClassOf
[
a rdfs:Class ;
m:id "mlm" ;
m:manufacturer "Lamborghini ;
m:model "Marzal" .
] .

In some respects this may be an easier approach than direct mapping -
there's a bit less commitment involved in adding another subclass when
another car manufacturer comes along. But there is a cost to the
reduced commitment, in that

<http://example.org/this-car> c:id "clm" .

would give us

<http://example.org/this-car> rdf:type :LM .

but wouldn't give

<http://example.org/this-car> c:id "mlm" .

Of course it might still be possible to make the subClass relationship
symmetrical again, i.e. make :LM owl:equivalentClass to the other
classes.

On top of the simple equivalence mapping there could be constraints
etc. that may or may not be expressable in RDF/OWL.

I think it's probably true that mapping between entities is usually
easier than mapping between relationships, seems like there's more to
vary e.g.

<http://example.org/this-car> c:color <http://corgi.com/colors#Green> .

<http://example.org/this-car> m:color "#00FF00" .

<http://example.org/this-car> x:colour "Green" .

etc.

> Given this scenario, what would be the best approach for consolidating
> the information as much as possible.

I'm not sure there is enough information here to determine "best", in
fact it might not be possible to tell until you've done a full
implementation of all the promising approaches and evaluated. Even
then there may not be a clear best ;-)

 I would appreciate comments on how
> one might accomplish this in a way that may not produce much unnecessary
> duplication in the data store.

That could make a significant difference, although I would imagine it
would depend a lot on the particular store implementation. My guess is
to optimise on that aspect you'd need complete RDF/OWL reasoning plus
any extra rules to infer syntax and other 'hidden' mappings, like:

_:x  m:color "#00FF00" .
=>
_:x x:colour "Green" .
=>
_:x x:color "green" .

Once you'd got all that, it should be possible in principle to keep
the graph lean (not sure what there is implementation-wise, Reto's got
a leanifier at http://gmuer.ch/2005/11/24/making-graphs-lean).

Cheers,
Danny.





--

http://dannyayers.com

Received on Sunday, 22 January 2006 12:04:33 UTC