Re: Reasoning over millions of triples - any working reasoners?

[Cor, this takes me back a couple of decades to when I was functional and parallel; now I am non-functional and convergent.]

In our case (and I suspect Harry's), it is not simply a case of the speed of DB querying (or assertion).
Other functions are required: for example, provenance of each assertion (which I think is what his research is about, analysing the communities of sameAs statements); rollback when an assertion is found to be wrong and other assertions have been made based on the incorrect information (this is not classical DB rollback); deprecation of URIs.
We do actually use a DB for this, but it took some work to decide on the schemas that are sufficiently flexible and powerful to satisfy research requirements that may appear later.
Actually, an RDF store satisfies this flexibility better, but when we started many years ago, we moved from an RDF store to a DB because at that stage we weren't confident in the query-rate and scaling.

I think Harry's message was simply asking if he could avoid having to do exactly what these messages are suggesting :-)

Cheers

On 22 Jan 2011, at 01:45, Enrico Franconi wrote:

> I fully agree here: this is a classical DB/data-structure task which has several known solutions in the classical literature. The novel challenge here is to find a balance/tradeoff of the effectiveness of the index in compression and access time with its updatability. This is no "reasoning".
> cheers
> --e.
> 
> On 21 Jan 2011, at 23:02, Sampo Syreeni wrote:
> 
>> On 2011-01-18, Harry Halpin wrote:
>> 
>>> I've got a big bunch of owl:sameAs statements (about 50 million in n-triples) and I want to do some reasoning over them, i.e. look for chains of sameAs. Does anyone know of any reasoners that handle that amount of data?
>> 
>> I for one don't. But there is a whole bunch of literature on how to reduce such chains into ninimal form, efficiently, in the database literature. And if you just happen to have some 50M static triples, the problem ought to be pretty much trivial; the real problem only surfaces when you have tens of terabytes of data that is changing at some tens of megatriples per diem.
>> 
>> Personally, I'd go with compressed, virtual indices into the naming tree, coarce digital/splay trees as an index to that, distribute the whole thing, and then employ tree merging as the primary distribution primitive. That would conveniently bring in at least hundreds or low thousands of processors, even over a commodity network, with some efficiency.
>> 
>>> I believe there is an EU project on this (Larkc), but I can't get WebPIE working over this data-set for some reason, working it through with them right now, but I'd like to know if there's any other large-reasoners.
>> 
>> Mashups like these aren't a general reasoning task, per se. They're a very common and special purpose task, which deserves its own code.
>> 
>>> Otherwise, I'll just have to write some giant hash-table thing myself in Perl, but I'd prefer to try too dogfood it :)
>> 
>> So I think it would actually be pretty nice if you wrote it up de novo. Just, don't use Perl or hashes. Rather use pure standard C with MPI as a an option for full distribution of the algorithm. ;)
>> -- 
>> Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front
>> +358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
>> 
> 
> 

-- 
Hugh Glaser,  
              Intelligence, Agents, Multimedia
              School of Electronics and Computer Science,
              University of Southampton,
              Southampton SO17 1BJ
Work: +44 23 8059 3670, Fax: +44 23 8059 3045
Mobile: +44 78 9422 3822, Home: +44 23 8061 5652
http://www.ecs.soton.ac.uk/~hg/

Received on Saturday, 22 January 2011 11:36:32 UTC