RDF and its discontents

Here are some of my thoughts

(1) The global namespace in RDF plus the concept that "most knowledge can be
efficiently represented with triples" are brilliant;  in the long term we're
going to see these two concepts diffuse into non-RDF systems because they
are so powerful.  I appreciate the way multiple languages are implemented in
RDF -- although imperfect,  it's a big improvement over what I've had to do
to implement multi-lingual "digital libraries" on relational systems.

(2) Yet,  the "big graph" and triple paradigms run into big problems when we
try to build real systems.  There are two paradigms I work in:  (i) storing
'facts' in a database,  and (ii) processing 'facts' through pipelines that
effectively do one or more "full scans" of data;  type (ii) processes can be
highly scalable,  however,  when they can be parallelized.

Now,  if hardware cost was no object,  I suppose I could keep triples in a
huge distributed main-memory database.  Right now,  I can't afford that.
 (If I get richer and if hardware gets cheaper,  I'll probably want to
handle more data,  putting me back where I started...)

Today I can get 100x performance increases by physically partitioning data
in ways that reflect the way I'm going to use it.  Relational databases are
highly mature at this,  but RDF systems barely recognize that there's an
issue.  Named graphs are a step forward in this direction,  but to make
something that's really useful we'd need both (a) the ability to do graph
algebra,  and (b) the ability to automatically partition 'facts' into
graphs.  That 'automatic' could be something similar to RDBMS practice ("put
this kind of predicate in that graph",  "put triples with this sort of
subject in that graph") or it could be something really 'intelligent',  that
can infer likely use patterns by reasoning over the schema and/or by
adaptive profiling of actual use (as Salesforce.com does to build a pretty
awesome OLTP system on top of what's a triple store at the core.)

Practically,  I deal with this by building hybrid systems that combine both
relational and RDF ideas.  If you're really trying to get things done in
this space,  however,  it's amazing how precarious the tools are.  For
instance,  I looked at a large number of data stores and wound up choosing
MySQL based on two fairly accidental facts:  (i) I couldn't get VARCHAR() or
TEXT() fields in other RDMS systems to handle the full length of Freebase
text fields in an indexable way,  and (ii) mongodb crashes and corrupts
data.

As for the linear pipelines,  the big issue I have is that I want to process
"facts" as complete chunks;  everything needed for one particular bit of
processing needs to get routed to the right pipeline.  If it takes four
triples involving a bnode to represent a 'fact',  these all need to go to
the same physical node.  As in the database case,  partitioning of data
becomes a critical issue,  but it becomes even more here that the partition
a particular triple falls in might be determined by the graph that surrounds
a triple,  which kind of points to a representation where we (a) develop
some mechanism for efficiently representing subgraphs of related triples,
 or (b) just give up on the whole triple thing and use something like a
relational or JSON model to represent facts.

(3) I've spoken to entrepreneurs and potential customers of semantic
technology and found that,  right now,  people want things that are beyond
the state of the art.  Often when I consult with people,  I come to the
conclusion that they haven't found the boundaries of what they could
accomplish through plain old "bag of words" and that it's not so clear
they'll do better with NLP/semantic tech.  Commonly,  these people have
fundamental flaws in their business model (thinking that they can pay some
Turk $0.20 cents to do $2000 worth of marketing work.)  The most common
theme in semantic "product" companies that that they build complex systems
out of components that just barely work.

I'll single out Zemanta for this,  although this is true of many other
companies.  Let's just estimate that Zemanta's service has 5 components and
each of these is 85% accurate;  put those together,  and you've got a system
that's just an embarrassment.  There are multiple routes to solving this
problem (either a "widening of the scope" or a "narrowing of the scope"
could help a lot) the fact is that a lot of companies are aiming for that
"sour spot" which has the paradoxical dual effects that:  (i) some others
imitate them,  and (ii) others write off the whole semantic space.  Success
in semantic technology is going to come from companies that find fortuitous
matches between "what's possible" and "what can be sold"

Another spectre that haunts the space is legacy "information services"
companies.  I've talked with many people who think they're going to make big
money selling a crappy product to undiscriminating customers with deep
pockets (U.S. Government,  Finance,  Pharma, ...)  I think the actual
breakthroughs in semantic tech are going to come from the disruptive
direction:  people who find ways to make things that are drastically cheaper
than the old way,  but that can accept the limitations of today's semantic
tech.

(4) I'm one of the people who got interested in semantic tech because of
DBPedia,  but yet,  I've also largely given up on DBPedia.  One day I
realized that I could,  with Freebase,  do things in 20 minutes that would
take 2 weeks of data cleanup with DBPedia.  DBPedia 3.5/3.5.1 seems to be a
large step backwards,  with major key integrity problems that are completely
invisible to 'open world' and OWL-paradigm systems.  I've wound up writing
my own framework for extracting 'facts' from wikipedia because DBPedia isn't
interested in extracting the things I want.  Every time I try to do
something with DBpedia,  I make shocking discoveries (for instance,  "New
York City",  "Berlin",  "Tokyo",  "Washington , D.C." and "Manchester, N.H."
are not of rdf:type "City")  The fact that I see so little complaining about
this on the mailing list seems to indicate that not a lot of people are
trying to do real work it.

(5) It might make me a heretic,  but I've found that the closed world
assumption can,  properly used,  (i) do miracles,  and (ii) directly
confront many of the practical problems that show up in RDF systems.  OWL
has greatly changed my thinking about schemas...  I'm less concerned,
 however,   about the official semantics of OWL,  than I am about the
general prospect of "reasoning about schemas."  I think the
"inference-based" model of OWL is awesome,  but pretty frequently I find I
need forms of reasoning that aren't quite supported by OWL...  Alternately,
 data partitioning and data validation is really important for me,  so I
need something that has some of the nature of an RDMS schema. Of course,  I
can get some of this by "applying my own hermeneutics" to OWL and adding
some features

Received on Friday, 2 July 2010 15:08:14 UTC