Re: Knowledge graph toolkit from Pedro Szekely on 2020-05-04 (semantic-web@w3.org from May 2020)

From: Pedro Szekely <szekely@usc.edu>
Date: Sun, 3 May 2020 18:13:26 -0700
To: Adrian Gschwend <ml-ktk@netlabs.org>
Cc: semantic-web@w3.org
Message-Id: <76700024-D10A-4D99-ABF4-9ED05339321C@usc.edu>
This is a great discussion. The fact that the following query times out in Wikidata is a problem that makes folks skeptical that triple stores are for real:

#Count scholarly articles
SELECT (COUNT(?article) AS ?article_count)
WHERE 
{
  ?article wdt:P31 wd:Q13442814.
}

P

Pedro Szekely
Principal Scientist / USC Information Sciences Institute
Research Director / Center on Knowledge Graphs, USC/ISI
Research Associate Professor / USC Viterbi Computer Science Department
pedro szekely  <http://usc-isi-i2.github.io/szekely/>| kg center <http://usc-isi-i2.github.io/home/> | 562.889.3149




> On May 3, 2020, at 2:51 AM, Adrian Gschwend <ml-ktk@netlabs.org> wrote:
> 
> On 03.05.20 08:43, Amirouche Boubekki wrote:
> 
> [...]
>> offer. What is required is indeed a relational database like RDF
>> describes. But more than that, a modern AI system has to tackle
>> heterogeneous data types that do not blend nicely into the RDF
>> framework. I forgot to mention geometric data. I forgot to mention
>> strong ACID guarantees.
> I would say there is no other data model out there which can unify
> heterogeneous data types better than RDF. What does in your opinion "not
> blend nicely into the RDF framework"?
> 
>> It has to do with RDF with the fact that people spread the idea that
>> RDF framework is a go to solution to do semantic work. Except, it does
>> not provide a solution for:
>> 
>> - full text search
> 
> nonsense, there is no standard API but pretty much every triplestore I
> know provides that, see
> 
> https://urldefense.com/v3/__https://github.com/w3c/sparql-12/issues/40__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxdteN3I4$ 
> 
> Just because it's not in current SPARQL spec does not mean it's not
> there at all. Also we do work on SPARQL 1.2, that's the beauty of open
> standards.
> 
>> - geometric search
> 
> https://urldefense.com/v3/__https://www.ogc.org/standards/geosparql/__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzx11j-O0M$ 
> 
> It's a not really well written spec but it's there since 2011 and
> various stores implement that, for example Jena:
> 
> https://urldefense.com/v3/__https://jena.apache.org/documentation/geosparql/__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxx2e_Fn8$ 
> 
>> - keyword suggestion (approximate string matching)
> 
> see all lucene based fulltext-search implementations above
> 
>> - historisation
> 
> There are a whole bunch of papers about versioning RDF from a research
> POV, I know that at least Stardog implements that in their product.
> 
> My colleague just recently wrote a versioned RDF store for distributed
> IoT devices so that's surely a solvable problem.
> 
> While I always thought I absolutely need versioning I noticed that in
> reality this is far less the case, because I often model the data
> versioned in RDF directly so no need to get that on store level.
> 
>> - ACID guarantees
> 
> Again, solvable. Stardog does this for example.
> (https://urldefense.com/v3/__https://stardog.docs.apiary.io/*reference/managing-transactions__;Iw!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxyleRxdY$ )
> 
> OSS stacks have implementations as well, also there are discussions
> around transactions in the SPARQL 1.2 CWG:
> https://urldefense.com/v3/__https://github.com/w3c/sparql-12/issues/83__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxW8maJcs$ 
> 
>> And probably others that I forget.
> 
> you seem to have decided that RDF is not for you, this is totally fine.
> 
> But YMMV, I think RDF is *the* stack to build KGs on and I have not been
> disappointed so far. If we miss something, we try to add it to the stack.
> 
>> Two things:
>> 
>> 1) For the record: money is not Science. Profitable does not
>> necessarily mean a Good Thing.
> 
> No disagreement here but how is that related to the scaling remark?
> 
>> 2) There is not publicly available project using publicly available
>> software that scale beyond 1TB.
> 
> What you want to say is "I am not aware of a publicly available project
> using publicly available software that scale beyond 1TB". Also, sorry to
> disappoint you:
> 
> https://urldefense.com/v3/__https://de.slideshare.net/jervenbolleman/sparqluniprotorg-in-production-poster__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxcAmJXhg$ 
> 
> That was 2017, Uniprot again grew since then, latest number I have in
> mind is well above 50 billion triples.
> 
> For larger-scale Open Source RDF implementations you might want to consider:
> 
> https://urldefense.com/v3/__https://cm-well.github.io/CM-Well/index.html__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzx_6ecOi4$ 
> 
> See for example the high-level architecture here:
> 
> https://urldefense.com/v3/__https://cm-well.github.io/CM-Well/Introduction/Intro.CM-WellHigh-LevelArchitecture.html__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxVSWrAUw$ 
> 
> If you think this is too complicated please remember that Uniprot runs
> on a single machine using Virtuoso.
> 
> There are a few other large-scale stores like Apache Rya but I did not
> try those yet.
> 
>> Indeed, when one asks me my advice about a _basic_ toolkit to do KG, I
>> recommend FDB, because it can handle all the cases previously
>> mentioned. And also I do not to forget to mention that it is a long
>> journey, especially if you want to be valid in the regard of RDF
>> standard.
> 
> That is a tooling question and that got a lot better the past years. But
> still work to do for sure and we work on that.
> 
>> As far as I am concerned RDF offers good guiding principles, but it
>> requires decades long of study (much like compiler work) to grasp
>> which is a bummer. I ought to be simpler, much simpler and that is
>> what I am doing in my projects: taking the best of RDF and leaving
>> aside what is not necessary.
> 
> I disagree here and I talk from experience. I do a lot of RDF teaching
> and once people understand the basics, they can be extremely productive
> with RDF.
> 
>> exists.  But I will not forsake advancement and innovation for the
>> purpose of backward compatibility with something that is so gigantic,
>> especially when something easier is possible.
> 
> Again that is fine if your use-cases are limited. We leverage the power
> of the RDF stack so "something easier" means "something less powerful".
> 
> regards
> 
> Adrian
>
Received on Monday, 4 May 2020 01:13:43 UTC