- From: Hugh Glaser <hg@ecs.soton.ac.uk>
- Date: Wed, 19 Jan 2011 11:42:11 +0000
- To: Steve Harris <steve.harris@garlik.com>
- CC: Ivan Mikhailov <imikhailov@openlinksw.com>, William Waites <ww@styx.org>, Harry Halpin <hhalpin@ibiblio.org>, Semantic Web <semantic-web@w3.org>
Hi Harry, Still wondering what you want to do with it. If all you want to do is find the sameAs equivalence classes inferred by lots of sameAs predicates, then that is what we do with the CRS system that supports http://sameas.org/ and all our other CRS stores. I think sameas.org currently has roughly 40M URIs in 10M equivalence classes from 70M triples. Many are only of size 2, but others are several 100. We could bring up a CRS for you (we do for others), or you could install it yourself. If you want something more complicated, then I guess use something else. Best Hugh Performance: I don't think we have ever done any performance testing on the CRS - fast enough is good enough. But I have just done some quick measurements: >From a php script For http://data.semanticweb.org/person/harry-halpin, which returns 29 URIs, it takes about 0.25 ms. For http://dbpedia.org/resource/Hal_Halpin, which only returns a 2-bundle, it takes around 0.1 ms For http://dblp.l3s.de/d2r/resource/authors/Harry_Halpin, which is a singleton, it takes around 0.05 ms. >From the host command line, wget "http://sameas.org/n3?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FSoton" takes about 22ms, returning 83 URIs, but of course direct calls are much faster (http, shell start ups, file writing, etc.). I guess Ian did a pretty good job in optimising for high query rates - 10K+ queries/sec for the common cases seems quite good to me. (It runs on our 3 or 4 year old Dell server.) By the way, as I recall, it was the awkwardness of dealing with the pair-wise nature of sameAs that was a strong influence of us building the CRS in the first place. On 19 Jan 2011, at 09:03, Steve Harris wrote: > FWIW, when building foaf.qdos.com, which does IFP closure, and sameAs resolution over FOAF data, we did it live, in SPARQL queries. The algorithm is pretty simple, and like Ivan says it's often not practical to do it at import time. > > In the case of FOAF it's especially tricky as there's a lot of errors, and things connected which shouldn't be. We have a blacklist of known-bad IFPs, and some heuristics abut when/ when not to join things which are allegedly sameAs. > > Suppose you want to find the sameAs closure for <a> <b> and <c>, you just do something like: > > SELECT DISTINCT ?o > WHERE { > { ?s owl:sameAs ?o } UNION { ?o owl:sameAs ?s } > FILTER(?s = <a> || ?s = <b> || ?s = <c>) > FILTER(?o != <a> && ?o != <b> && ?0 != <c>) > } > > add the ?o's onto your equivalence set, and feed them back into the same query. > > Worst case is that it will require a number of queries equal to the diameter of the equivalence graph. > > If you just want a vanilla owl:sameAs then there's no real advantage, but if working with real world, messy data you might want some logic in there to ignore certain graphs, or URIs when resolving. > > If your SPARQL engine support SPARQL 1.1 property paths efficiently, you can probably do it in a single query. > > - Steve > > On 2011-01-19, at 06:25, Ivan Mikhailov wrote: > >> William, >> >> Virtuoso Universal Server was extended recently, and now demonstrates up >> to 170 times better speed for "inconvenient" inference cases, the >> related patch is on its way to (coming soon) Virtuoso Open Source. >> OTOH, this change has been made so late because most of queries reported >> before were successfully tuned. >> >> Materialization of some transitive closure for sameAs could be nice, but >> it is not realistic on big (esp. "compound") datasets like >> lod.openlinksw.com or services.data.gov : too many regular updates. >> >> Best Regards, >> >> Ivan Mikhailov >> OpenLink Software >> http://virtuoso.openlinksw.com >> >> On Tue, 2011-01-18 at 22:30 +0100, William Waites wrote: >>> * [2011-01-19 01:48:27 +0600] Ivan Mikhailov <imikhailov@openlinksw.com> écrit: >>> >>> ] Virtuoso deals with owl:sameAs in a scalable way, so you can try. Of >>> ] course, a single chain 50 million connections long would cause problems, >>> ] but more traditional cases should work fine. Google for "virtuoso >>> ] owl:same-as input:inference" may be the fastest way to get more hints. >>> >>> Maybe I'm doing something wrong but in my experience >>> Virtuoso's owl:sameAs handling is not great. For example >>> in bibliographca we have, >>> >>> foo a bibo:Book ; >>> dc:contributor [ >>> foaf:name "Bob"; >>> owl:sameAs <http://some/author> >>> ]. >>> >>> When the http://some/author is dereferenced it will first >>> look for graph named that in the store. If it doesn't it >>> goes and asks the store for all triples that have that as >>> a subject with sameAs processing turned on (would be nicer to >>> have a bnode closure, actually). If there are many books that >>> have contributor sameAs that (where many is maybe 50) the >>> query takes too long and times out. >>> >>> At this stage I would not recommend using Virtuoso's sameAs >>> processing and am going to materialise these graphs... >>> >>> As far as strategies for dealing with sameAs are concerned, >>> I've been meaning to do some experiments regrouping them into >>> a congruence closure or bundle as its sometimes called, then >>> doing things like migrating all properties from the leaves >>> to the root. Some preprocessing that worked like that would >>> bake a database structure that was much easier to work with >>> instead of trying to solve things by implementing the formal >>> definition directly (and recursively!). >>> >>> Cheers, >>> -w >> >> >> > > -- > Steve Harris, CTO, Garlik Limited > 1-3 Halford Road, Richmond, TW10 6AW, UK > +44 20 8439 8203 http://www.garlik.com/ > Registered in England and Wales 535 7233 VAT # 849 0517 11 > Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD > > -- Hugh Glaser, Intelligence, Agents, Multimedia School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ Work: +44 23 8059 3670, Fax: +44 23 8059 3045 Mobile: +44 78 9422 3822, Home: +44 23 8061 5652 http://www.ecs.soton.ac.uk/~hg/
Received on Wednesday, 19 January 2011 11:43:27 UTC