Re: Alternative Linked Data principles from Paul Houle on 2014-04-30 (public-lod@w3.org from April 2014)

From: Paul Houle <ontology2@gmail.com>
Date: Wed, 30 Apr 2014 13:04:33 -0400
To: Melvin Carvalho <melvincarvalho@gmail.com>
Cc: Luca Matteis <lmatteis@gmail.com>, Linked Data community <public-lod@w3.org>
Message-ID: <CAE__kdRGujrN0JfZgKi81hxf2G5KaHJpXT8DuGPWac33H33TJw@mail.gmail.com>
I think also RDF is not so inflexible as people think it is.

For me the 'atom' of RDF is not the triple,  but the node.  In the
node you get the global namespace of IRIs and you also get the
richness of the xsd data types,  and plus the ability to define your
own data types.  There's a pretty clear direction towards better XML
integration as well,  since you could add XML support similar to what
is relational databases like Microsoft SQL server just by adding a few
functions to SPARQL.

In the Cyc system,  which was oriented towards very complex
ontologies,  logic and such,  it was reported that 95% of the data was
triples so the system had special optimizations for triples.  Generic
databases like Freebase and DBpedia have an even higher fraction of
instance information and the case is even stronger for them.

If 5% of data involves an arity greater than 3,  then there are blank
nodes or things that are similar,  such as the compound value types in
Freebase,  which are like blank nodes in some ways but have global
names.

It's a slight oversimplification to say that a triple is three nodes
with set semantics,  but if you want information that is tabular
(practical for many purposes) there is the SPARQL result set format.

The best case that 'RDF sux' comes from serialization and
deserialization overhead.  I run $500 dollar jobs in Amazon EMR where
I think at least $300 of the cost comes from
serialization/deserialization and it's clear in this case that a
compact and quickly parsable format can make the difference between
something that works in a business sense or something that doesn't.
If latency matters (i.e. you are Github and not Altassian) and you are
doing a hard-code SOA implementation

http://martinfowler.com/articles/microservices.html

then it may eventually dawn on you that your customers will feel the
difference between binary serialization and something like JSON or
Turtle.  For one thing there is the parsing overhead (you can do a lot
of FLOPS in the time that you can parse a String to a float) and
there's the issue that that kind of data set replicates schema
information in each record.

There are a lot of challenges,  though,  in implementing a complex of
services that use efficient serialization,  one of them is that you
can go crazy keeping the serialization system up to date with the data
model and dealing with the consequences in your code.

After I'd been thinking about this a lot I ran into Marc Hadfield at a
conference,  and he talked about

http://vital.ai/

which has an efficient serialization system that uses OWL models.  It
generates Java code much like that of a ORM or OXM system.  Thus he's
got a good bus to feed RDF data to all kinds of subsystems.

The two linked data ideas that are most problematic I think are
''public SPARQL endpoint'' and "dereferencing" and these are for
reasons fundamental to distributed systems.

I'm building a triple store construction kit for AWS and been writing about it:

https://github.com/paulhoule/infovore/wiki/Producing-a-:BaseKB-Triple-Store-Distribution

I built a distribution that looked pretty good and the first query I
wanted to run was

select ?s count(?s) as ?cnt
  { ?s ?p ?o .}
  group by ?s
  order by desc(?cnt)
  limit 10;

It ran for a few minutes and then filled the disk with a temporary
file.  Turned out I had very little extra space on that machine so it
was easy to fill the disk.

That query above is a time bomb,  but it's also a query you need to
run to characterize the database.  I can run my own database and take
a risk of trashing it,  even give you an AMI that gives you the
opportunity to do the same, but a public SPARQL store can't answer it
for you.

The problem isn't with SPARQL per se,  you'd have the same problem
with SQL or mongodb.  The problem is that "practical" API(s) like this

http://developer.marvel.com/

have so many limitations connected with commercial considerations that
it's hard to do anything interesting with them.

However you define profit,  a SPARQL endpoint certainly must cover its
operating cost to persist.  The model that you can run your own in a
cloud for maybe $0.25-$1.00 an hour could work because the entrance
cost is low but the user is paying for the hardware in a scalable way.
 I can produce a database for $5 or so,  and it costs $2 a month to
store it where you can get it.

As for dereferencing,  that query above pulls up the 10 nodes which
are the subjects of the most facts.  The first one has 677802 facts.

If I GET some URI by http over the public network,  it's a dicey
proposition about what will happen if I pull down that many facts.  If
the server doesn't throttle the request,  the server may be loaded
heavily.  Something could time out somewhere.  Should the data arrive
against all odds,  the client could OOM.

A small number of topics are affected by this,  but they are heavily
linked and will turn up again and again when people (or machines) do
whatever it is that interests them.

Practically,  though,  I think of dereferencing as a special case of a
"graph fragment server".  For some of the work I do,  the most
attractive algorithms use a key-value store where key is a subject and
the value is an RDF graph.  For a site like this

http://ookaboo.com/o/pictures/

I think you'd want to present me a Linked Data URI to DBpedia,
Freebase or anywhere and tell you my opinion about it,  which in this
case is all about what pictures illustrate the topic.  Then you might
want to go to a lot of other authorities and ask what they think about
it.

A least-cost dereferencing implementation based on cloud technology
would be great,  but the next step is "ask ?x about ?y".

Anyway,  if you like high quality data files that are easy to use,  a
small contribution you make to this

https://www.gittip.com/paulhoule/

will pay my operating costs and for expansion of a data processing
operation which is an order of magnitude more efficient than the
typical Master Data Management operation.














On Wed, Apr 30, 2014 at 6:37 AM, Melvin Carvalho
<melvincarvalho@gmail.com> wrote:
>
>
>
> On 28 April 2014 17:23, Luca Matteis <lmatteis@gmail.com> wrote:
>>
>> The current Linked Data principles rely on specific standards and
>> protocols such as HTTP, URIs and RDF/SPARQL. Because I think it's
>> healthy to look at things from a different prospective, I was
>> wondering whether the same idea of a global interlinked database (LOD
>> cloud) was portrayed using other principles, perhaps based on
>> different protocols and mechanisms.
>
>
> If you look at the design principles behind Linked Data (timbl's or even
> Brian Carpenter's) you'll find something called the TOII -- Test of
> Independent Invention.
>
> What that means is if there were another system that had the same properties
> as the web, ie Universality, Tolerance, Modularity etc. using URIs it would
> be gauranteed to be interoperable with Linked Data.  Linked data is and isnt
> special.  It isnt special in that it could be independently invented by an
> equally powerful system.  It is special in that as a first mover (just as
> the web) it has the advantage of a wide network effect.
>
> Have a look at timbl's presentation on the TOII or at design issues axioms
> and principles.
>
> http://www.w3.org/Talks/1998/0415-Evolvability/slide12-1.htm
>
>>
>>
>> Thanks,
>> Luca
>>
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com
Received on Wednesday, 30 April 2014 17:05:01 UTC