- From: Pedro Szekely <szekely@usc.edu>
- Date: Sun, 3 May 2020 18:13:26 -0700
- To: Adrian Gschwend <ml-ktk@netlabs.org>
- Cc: semantic-web@w3.org
- Message-Id: <76700024-D10A-4D99-ABF4-9ED05339321C@usc.edu>
This is a great discussion. The fact that the following query times out in Wikidata is a problem that makes folks skeptical that triple stores are for real: #Count scholarly articles SELECT (COUNT(?article) AS ?article_count) WHERE { ?article wdt:P31 wd:Q13442814. } P Pedro Szekely Principal Scientist / USC Information Sciences Institute Research Director / Center on Knowledge Graphs, USC/ISI Research Associate Professor / USC Viterbi Computer Science Department pedro szekely <http://usc-isi-i2.github.io/szekely/>| kg center <http://usc-isi-i2.github.io/home/> | 562.889.3149 > On May 3, 2020, at 2:51 AM, Adrian Gschwend <ml-ktk@netlabs.org> wrote: > > On 03.05.20 08:43, Amirouche Boubekki wrote: > > [...] >> offer. What is required is indeed a relational database like RDF >> describes. But more than that, a modern AI system has to tackle >> heterogeneous data types that do not blend nicely into the RDF >> framework. I forgot to mention geometric data. I forgot to mention >> strong ACID guarantees. > I would say there is no other data model out there which can unify > heterogeneous data types better than RDF. What does in your opinion "not > blend nicely into the RDF framework"? > >> It has to do with RDF with the fact that people spread the idea that >> RDF framework is a go to solution to do semantic work. Except, it does >> not provide a solution for: >> >> - full text search > > nonsense, there is no standard API but pretty much every triplestore I > know provides that, see > > https://urldefense.com/v3/__https://github.com/w3c/sparql-12/issues/40__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxdteN3I4$ > > Just because it's not in current SPARQL spec does not mean it's not > there at all. Also we do work on SPARQL 1.2, that's the beauty of open > standards. > >> - geometric search > > https://urldefense.com/v3/__https://www.ogc.org/standards/geosparql/__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzx11j-O0M$ > > It's a not really well written spec but it's there since 2011 and > various stores implement that, for example Jena: > > https://urldefense.com/v3/__https://jena.apache.org/documentation/geosparql/__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxx2e_Fn8$ > >> - keyword suggestion (approximate string matching) > > see all lucene based fulltext-search implementations above > >> - historisation > > There are a whole bunch of papers about versioning RDF from a research > POV, I know that at least Stardog implements that in their product. > > My colleague just recently wrote a versioned RDF store for distributed > IoT devices so that's surely a solvable problem. > > While I always thought I absolutely need versioning I noticed that in > reality this is far less the case, because I often model the data > versioned in RDF directly so no need to get that on store level. > >> - ACID guarantees > > Again, solvable. Stardog does this for example. > (https://urldefense.com/v3/__https://stardog.docs.apiary.io/*reference/managing-transactions__;Iw!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxyleRxdY$ ) > > OSS stacks have implementations as well, also there are discussions > around transactions in the SPARQL 1.2 CWG: > https://urldefense.com/v3/__https://github.com/w3c/sparql-12/issues/83__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxW8maJcs$ > >> And probably others that I forget. > > you seem to have decided that RDF is not for you, this is totally fine. > > But YMMV, I think RDF is *the* stack to build KGs on and I have not been > disappointed so far. If we miss something, we try to add it to the stack. > >> Two things: >> >> 1) For the record: money is not Science. Profitable does not >> necessarily mean a Good Thing. > > No disagreement here but how is that related to the scaling remark? > >> 2) There is not publicly available project using publicly available >> software that scale beyond 1TB. > > What you want to say is "I am not aware of a publicly available project > using publicly available software that scale beyond 1TB". Also, sorry to > disappoint you: > > https://urldefense.com/v3/__https://de.slideshare.net/jervenbolleman/sparqluniprotorg-in-production-poster__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxcAmJXhg$ > > That was 2017, Uniprot again grew since then, latest number I have in > mind is well above 50 billion triples. > > For larger-scale Open Source RDF implementations you might want to consider: > > https://urldefense.com/v3/__https://cm-well.github.io/CM-Well/index.html__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzx_6ecOi4$ > > See for example the high-level architecture here: > > https://urldefense.com/v3/__https://cm-well.github.io/CM-Well/Introduction/Intro.CM-WellHigh-LevelArchitecture.html__;!!LIr3w8kk_Xxm!9NUtOVlNwL35ae5MQnlTDwqckkM7P8ydAjx-e7oi5nvhWpI6HARxkJzxVSWrAUw$ > > If you think this is too complicated please remember that Uniprot runs > on a single machine using Virtuoso. > > There are a few other large-scale stores like Apache Rya but I did not > try those yet. > >> Indeed, when one asks me my advice about a _basic_ toolkit to do KG, I >> recommend FDB, because it can handle all the cases previously >> mentioned. And also I do not to forget to mention that it is a long >> journey, especially if you want to be valid in the regard of RDF >> standard. > > That is a tooling question and that got a lot better the past years. But > still work to do for sure and we work on that. > >> As far as I am concerned RDF offers good guiding principles, but it >> requires decades long of study (much like compiler work) to grasp >> which is a bummer. I ought to be simpler, much simpler and that is >> what I am doing in my projects: taking the best of RDF and leaving >> aside what is not necessary. > > I disagree here and I talk from experience. I do a lot of RDF teaching > and once people understand the basics, they can be extremely productive > with RDF. > >> exists. But I will not forsake advancement and innovation for the >> purpose of backward compatibility with something that is so gigantic, >> especially when something easier is possible. > > Again that is fine if your use-cases are limited. We leverage the power > of the RDF stack so "something easier" means "something less powerful". > > regards > > Adrian >
Received on Monday, 4 May 2020 01:13:43 UTC