- From: Johnson, Matthew C. (LNG-HBE) <Matthew.C.Johnson@lexisnexis.com>
- Date: Tue, 29 Apr 2008 08:57:14 -0400
- To: <semantic-web@w3.org>
- Message-ID: <0FE5E87C5F0AE84B8C667FDC5224F6DA01FDC073@LNGDAYEXCP01VC.legal.regn.net>
I'd like to see if anyone can offer any advice on how to clean up duplicate literal values from a store of triples. I am working on a process to extract relationships from a set of DTDs where the subjects include items like elements and attributes and the predicates are properties such as "attributeOf", "hasAttribute", etc. One of the datatype properties is simply a title/name. My process includes parsing the DTD syntax into a XML form and then using XSLT to generate RDF (actually OWL) from there. Each element/attribute generates a new set of triples as they are found. Since an attribute (for example xml:id) exists on many elements, I am getting duplicate literal values for such attributes. myinst:a a myns:Element . myinst:a dc:title "a". myinst:id a myns:Attribute . myinst:id dc:title "id" . myinst:id myns:attributeOf myinst:a . myinst:b a myns:Element . myinst:b dc:title "b" . myinst:id a myns:Attribute . myinst:id dc:title "id" . myinst:id myns:attributeOf myinst:b . As you can see the triples for the "id" attribute will be generated many times. The same it true for elements (since I am planning on processing multiple DTDs) but the problem is less noticeable. The problem is compounded a little because I am also processing multiple DTDs. I could probably rework the XSLT to only produce the triples for a given attribute a single time but since I'm processing multiple DTDs (on separate passes), the issue is going to resurface anyway. As of now, I am not using any sort of triple store other than the filesystem. In order to remove the duplicate literal values (e.g. "id"), I have started to reprocess the results of the XSLT through CWM (cwm -rdf mystuff.rdf) and this appears to be collapsing the literals into a single unique value...presumably because it builds the model from all inputs and then regenerates a new serialization that, as a side-effect, only has a single literal value. I suppose the same is happening with the merge of the URI values. My question is whether this is a common scenario and is there a better method for handling this? Also, I'm only using cwm because that is what I already had. My concern is that the fact that CWM (or whatever tool) collapses these values doesn't seem to be an advertised "feature" and I'm not sure if there is a chance that the implementation could change in the future. Again, I'm not picking on CWM...it's just an example of a tool. Does anyone have thoughts on whether it is an implied/expected feature of RDF tools [that perform merges] that they also collapse these literal values? Thanks in advance, Matt
Received on Tuesday, 29 April 2008 12:58:43 UTC