advice on duplicate literals from Johnson, Matthew C. (LNG-HBE) on 2008-04-29 (semantic-web@w3.org from April 2008)

From: Johnson, Matthew C. (LNG-HBE) <Matthew.C.Johnson@lexisnexis.com>
Date: Tue, 29 Apr 2008 08:57:14 -0400
To: <semantic-web@w3.org>
Message-ID: <0FE5E87C5F0AE84B8C667FDC5224F6DA01FDC073@LNGDAYEXCP01VC.legal.regn.net>

I'd like to see if anyone can offer any advice on how to clean up
duplicate literal values from a store of triples.  I am working on a
process to extract relationships from a set of DTDs where the subjects
include items like elements and attributes and the predicates are
properties such as "attributeOf", "hasAttribute", etc.  One of the
datatype properties is simply a title/name.  My process includes parsing
the DTD syntax into a XML form and then using XSLT to generate RDF
(actually OWL) from there.  Each element/attribute generates a new set
of triples as they are found.  Since an attribute (for example xml:id)
exists on many elements, I am getting duplicate literal values for such
attributes.

 

myinst:a a myns:Element .

myinst:a dc:title "a".

 

myinst:id a myns:Attribute .

myinst:id dc:title "id" .

myinst:id myns:attributeOf myinst:a .

 

myinst:b a myns:Element .

myinst:b dc:title "b" .

 

myinst:id a myns:Attribute .

myinst:id dc:title "id" .

myinst:id myns:attributeOf myinst:b .

 

As you can see the triples for the "id" attribute will be generated many
times.  The same it true for elements (since I am planning on processing
multiple DTDs) but the problem is less noticeable.  The problem is
compounded a little because I am also processing multiple DTDs.  I could
probably rework the XSLT to only produce the triples for a given
attribute a single time but since I'm processing multiple DTDs (on
separate passes), the issue is going to resurface anyway.

 

As of now, I am not using any sort of triple store other than the
filesystem.  In order to remove the duplicate literal values (e.g.
"id"), I have started to reprocess the results of the XSLT through CWM
(cwm -rdf mystuff.rdf) and this appears to be collapsing the literals
into a single unique value...presumably because it builds the model from
all inputs and then regenerates a new serialization that, as a
side-effect, only has a single literal value.  I suppose the same is
happening with the merge of the URI values.

 

My question is whether this is a common scenario and is there a better
method for handling this?  Also, I'm only using cwm because that is what
I already had.  My concern is that the fact that CWM (or whatever tool)
collapses these values doesn't seem to be an advertised "feature" and
I'm not sure if there is a chance that the implementation could change
in the future.  Again, I'm not picking on CWM...it's just an example of
a tool.

 

Does anyone have thoughts on whether it is an implied/expected feature
of RDF tools [that perform merges] that they also collapse these literal
values?

 

Thanks in advance,

 

Matt

Received on Tuesday, 29 April 2008 12:58:43 UTC