Re: RDF and its discontents

Hello Paul,

On Fri, 2010-07-02 at 11:07 -0400, Paul Houle wrote:
> Here are some of my thoughts

> (2) Yet,  the "big graph" and triple paradigms run into big problems
> when we try to build real systems.  There are two paradigms I work
> in:  (i) storing 'facts' in a database,  and (ii) processing 'facts'
> through pipelines that effectively do one or more "full scans" of
> data;  type (ii) processes can be highly scalable,  however,  when
> they can be parallelized.
> 
> 
> Now,  if hardware cost was no object,  I suppose I could keep triples
> in a huge distributed main-memory database.  Right now,  I can't
> afford that.  (If I get richer and if hardware gets cheaper,  I'll
> probably want to handle more data,  putting me back where I
> started...)

Huge and distributed means latency issues. My dream is a chassis of
blades interconnected by shortest paths --- 1 inch long radio connection
from CPU to neighbour CPU instead of routes
CPU-chipset-bus-card-card-bus-chipset-CPU.

Thus the number of facts per box is a critical issue. As a consequence,
number of bytes per fact is of paramount importance. I'd recommend to
put data to RDBMS tables and map them to RDF, in every possible case. I
describe it as cutting out dense rectangular pieces from an arbitrary
shaped cloud of RDF facts. The rest of cloud is poorly structured and
may cause problems for gathering statistics and query optimization, but
its weight can be much less. So I totally agree with your

> Practically,  I deal with this by building hybrid systems that combine
> both relational and RDF ideas.

BSBM can be used as a core of benchmark for tools of this sort. Say,
BSBM + dbpedia data.

> (3) I've spoken to entrepreneurs and potential customers of semantic
> technology and found that,  right now,  people want things that are
> beyond the state of the art.  Often when I consult with people,  I
> come to the conclusion that they haven't found the boundaries of what
> they could accomplish through plain old "bag of words" and that it's
> not so clear they'll do better with NLP/semantic tech.  Commonly,
> these people have fundamental flaws in their business model (thinking
> that they can pay some Turk $0.20 cents to do $2000 worth of marketing
> work.)  The most common theme in semantic "product" companies that
> that they build complex systems out of components that just barely
> work.

I'd add that even when things work they lack appropriate debugging
tools. They say, a good programmer can write 100 lines of production-
quality code in his "favorite" domain and his favorite IDE, and the
language does not matter, so rich and high-level language means more
functionality per working hour. However, when RDF+SPARQL comes to scene,
number of lines per day falls down. Reasons differ, from lack of
docs/tutorials to internal errors, result is same: even if costs are cut
by adding new technology, they are cut by unpredictable way.

> (4) I'm one of the people who got interested in semantic tech because
> of DBPedia,  but yet,  I've also largely given up on DBPedia.  One day
> I realized that I could,  with Freebase,  do things in 20 minutes that
> would take 2 weeks of data cleanup with DBPedia.  DBPedia 3.5/3.5.1
> seems to be a large step backwards,  with major key integrity problems
> that are completely invisible to 'open world' and OWL-paradigm
> systems.  I've wound up writing my own framework for extracting
> 'facts' from wikipedia because DBPedia isn't interested in extracting
> the things I want.  Every time I try to do something with DBpedia,  I
> make shocking discoveries (for instance,  "New York City",  "Berlin",
> "Tokyo",  "Washington , D.C." and "Manchester, N.H." are not of
> rdf:type "City")  The fact that I see so little complaining about this
> on the mailing list seems to indicate that not a lot of people are
> trying to do real work it.

That's why we're preparing infrastructure for "dbpedia live". Continuous
updates means quick fixes for bugs. It will be possible to prepare a
long list of "inconsistency check" queries like "list all cities that
have more than one mayor in same moment of time" or "list all cities
whose geo co-ordinates are outside of countries they belong to".

> (5) It might make me a heretic,  but I've found that the closed world
> assumption can,  properly used,  (i) do miracles,  and (ii) directly
> confront many of the practical problems that show up in RDF systems.

We're two heretics :) With "closed" SQL, typo in column name is just a
typo fixed immediately after compilation error. In "open" SPARQL, typo
in predicate is a typical reason for long debugging loop. As a
consequence, debugging tools should be more sophisticated (but they do
not).

> so I need something that has some of the nature of an RDMS schema.

That's why we pay so much attention to our RDF Views (moreover, our
"quad store" is just an "one-to-one" RDF view from four-column table
with columns G,S,P,O, one row per quad). We can make a "jail" for a
query and signal an compile-time error if it will refer to data not
listed in the metadata of the view.



The oldest etiquette rule is "Mammoth should be eaten in parts, not as a
whole". Can't it be a slogan for the semweb activity? Step by step, no
promises of silver bullets, with attention to benchmarks and legacy
interfaces/protocols and existing data sources etc.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com

Received on Monday, 5 July 2010 15:50:55 UTC