- From: Paul Houle <ontology2@gmail.com>
- Date: Wed, 30 Apr 2014 13:04:33 -0400
- To: Melvin Carvalho <melvincarvalho@gmail.com>
- Cc: Luca Matteis <lmatteis@gmail.com>, Linked Data community <public-lod@w3.org>
I think also RDF is not so inflexible as people think it is. For me the 'atom' of RDF is not the triple, but the node. In the node you get the global namespace of IRIs and you also get the richness of the xsd data types, and plus the ability to define your own data types. There's a pretty clear direction towards better XML integration as well, since you could add XML support similar to what is relational databases like Microsoft SQL server just by adding a few functions to SPARQL. In the Cyc system, which was oriented towards very complex ontologies, logic and such, it was reported that 95% of the data was triples so the system had special optimizations for triples. Generic databases like Freebase and DBpedia have an even higher fraction of instance information and the case is even stronger for them. If 5% of data involves an arity greater than 3, then there are blank nodes or things that are similar, such as the compound value types in Freebase, which are like blank nodes in some ways but have global names. It's a slight oversimplification to say that a triple is three nodes with set semantics, but if you want information that is tabular (practical for many purposes) there is the SPARQL result set format. The best case that 'RDF sux' comes from serialization and deserialization overhead. I run $500 dollar jobs in Amazon EMR where I think at least $300 of the cost comes from serialization/deserialization and it's clear in this case that a compact and quickly parsable format can make the difference between something that works in a business sense or something that doesn't. If latency matters (i.e. you are Github and not Altassian) and you are doing a hard-code SOA implementation http://martinfowler.com/articles/microservices.html then it may eventually dawn on you that your customers will feel the difference between binary serialization and something like JSON or Turtle. For one thing there is the parsing overhead (you can do a lot of FLOPS in the time that you can parse a String to a float) and there's the issue that that kind of data set replicates schema information in each record. There are a lot of challenges, though, in implementing a complex of services that use efficient serialization, one of them is that you can go crazy keeping the serialization system up to date with the data model and dealing with the consequences in your code. After I'd been thinking about this a lot I ran into Marc Hadfield at a conference, and he talked about http://vital.ai/ which has an efficient serialization system that uses OWL models. It generates Java code much like that of a ORM or OXM system. Thus he's got a good bus to feed RDF data to all kinds of subsystems. The two linked data ideas that are most problematic I think are ''public SPARQL endpoint'' and "dereferencing" and these are for reasons fundamental to distributed systems. I'm building a triple store construction kit for AWS and been writing about it: https://github.com/paulhoule/infovore/wiki/Producing-a-:BaseKB-Triple-Store-Distribution I built a distribution that looked pretty good and the first query I wanted to run was select ?s count(?s) as ?cnt { ?s ?p ?o .} group by ?s order by desc(?cnt) limit 10; It ran for a few minutes and then filled the disk with a temporary file. Turned out I had very little extra space on that machine so it was easy to fill the disk. That query above is a time bomb, but it's also a query you need to run to characterize the database. I can run my own database and take a risk of trashing it, even give you an AMI that gives you the opportunity to do the same, but a public SPARQL store can't answer it for you. The problem isn't with SPARQL per se, you'd have the same problem with SQL or mongodb. The problem is that "practical" API(s) like this http://developer.marvel.com/ have so many limitations connected with commercial considerations that it's hard to do anything interesting with them. However you define profit, a SPARQL endpoint certainly must cover its operating cost to persist. The model that you can run your own in a cloud for maybe $0.25-$1.00 an hour could work because the entrance cost is low but the user is paying for the hardware in a scalable way. I can produce a database for $5 or so, and it costs $2 a month to store it where you can get it. As for dereferencing, that query above pulls up the 10 nodes which are the subjects of the most facts. The first one has 677802 facts. If I GET some URI by http over the public network, it's a dicey proposition about what will happen if I pull down that many facts. If the server doesn't throttle the request, the server may be loaded heavily. Something could time out somewhere. Should the data arrive against all odds, the client could OOM. A small number of topics are affected by this, but they are heavily linked and will turn up again and again when people (or machines) do whatever it is that interests them. Practically, though, I think of dereferencing as a special case of a "graph fragment server". For some of the work I do, the most attractive algorithms use a key-value store where key is a subject and the value is an RDF graph. For a site like this http://ookaboo.com/o/pictures/ I think you'd want to present me a Linked Data URI to DBpedia, Freebase or anywhere and tell you my opinion about it, which in this case is all about what pictures illustrate the topic. Then you might want to go to a lot of other authorities and ask what they think about it. A least-cost dereferencing implementation based on cloud technology would be great, but the next step is "ask ?x about ?y". Anyway, if you like high quality data files that are easy to use, a small contribution you make to this https://www.gittip.com/paulhoule/ will pay my operating costs and for expansion of a data processing operation which is an order of magnitude more efficient than the typical Master Data Management operation. On Wed, Apr 30, 2014 at 6:37 AM, Melvin Carvalho <melvincarvalho@gmail.com> wrote: > > > > On 28 April 2014 17:23, Luca Matteis <lmatteis@gmail.com> wrote: >> >> The current Linked Data principles rely on specific standards and >> protocols such as HTTP, URIs and RDF/SPARQL. Because I think it's >> healthy to look at things from a different prospective, I was >> wondering whether the same idea of a global interlinked database (LOD >> cloud) was portrayed using other principles, perhaps based on >> different protocols and mechanisms. > > > If you look at the design principles behind Linked Data (timbl's or even > Brian Carpenter's) you'll find something called the TOII -- Test of > Independent Invention. > > What that means is if there were another system that had the same properties > as the web, ie Universality, Tolerance, Modularity etc. using URIs it would > be gauranteed to be interoperable with Linked Data. Linked data is and isnt > special. It isnt special in that it could be independently invented by an > equally powerful system. It is special in that as a first mover (just as > the web) it has the advantage of a wide network effect. > > Have a look at timbl's presentation on the TOII or at design issues axioms > and principles. > > http://www.w3.org/Talks/1998/0415-Evolvability/slide12-1.htm > >> >> >> Thanks, >> Luca >> > -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254 paul.houle on Skype ontology2@gmail.com
Received on Wednesday, 30 April 2014 17:05:01 UTC