Re: RDF and its discontents from Pat Hayes on 2010-07-02 (public-lod@w3.org from July 2010)

From: Pat Hayes <phayes@ihmc.us>
Date: Fri, 2 Jul 2010 11:10:04 -0500
To: Paul Houle <ontology2@gmail.com>
Cc: Linked Data community <public-lod@w3.org>
Message-Id: <A01D4F3E-99B3-481C-9DAC-EFBE474F926D@ihmc.us>
Paul, please keep these thoughts coming. I have a couple of followups,  
inline below.

On Jul 2, 2010, at 10:07 AM, Paul Houle wrote:

> Here are some of my thoughts
>
> (1) The global namespace in RDF plus the concept that "most  
> knowledge can be efficiently represented with triples" are  
> brilliant;  in the long term we're going to see these two concepts  
> diffuse into non-RDF systems because they are so powerful.

FWIW, the reduction-to-triples idea has been around and known in the  
logic community since about 1880, so it is indeed powerful. It has its  
own issues, though (as some of your later comments suggest).

>  I appreciate the way multiple languages are implemented in RDF --  
> although imperfect,  it's a big improvement over what I've had to do  
> to implement multi-lingual "digital libraries" on relational systems.
>
> (2) Yet,  the "big graph" and triple paradigms run into big problems  
> when we try to build real systems.  There are two paradigms I work  
> in:  (i) storing 'facts' in a database,  and (ii) processing 'facts'  
> through pipelines that effectively do one or more "full scans" of  
> data;  type (ii) processes can be highly scalable,  however,  when  
> they can be parallelized.
>
> Now,  if hardware cost was no object,  I suppose I could keep  
> triples in a huge distributed main-memory database.  Right now,  I  
> can't afford that.  (If I get richer and if hardware gets cheaper,   
> I'll probably want to handle more data,  putting me back where I  
> started...)

Well, hardware will get cheaper. Especially fast memory. Care to  
extrapolate, say, five years to guess which will win, data bloat or  
RAM capacity?

> Today I can get 100x performance increases by physically  
> partitioning data in ways that reflect the way I'm going to use it.   
> Relational databases are highly mature at this,  but RDF systems  
> barely recognize that there's an issue.  Named graphs are a step  
> forward in this direction,  but to make something that's really  
> useful we'd need both (a) the ability to do graph algebra,  and (b)  
> the ability to automatically partition 'facts' into graphs.  That  
> 'automatic' could be something similar to RDBMS practice ("put this  
> kind of predicate in that graph",  "put triples with this sort of  
> subject in that graph") or it could be something really  
> 'intelligent',  that can infer likely use patterns by reasoning over  
> the schema and/or by adaptive profiling of actual use (as  
> Salesforce.com does to build a pretty awesome OLTP system on top of  
> what's a triple store at the core.)
>
> Practically,  I deal with this by building hybrid systems that  
> combine both relational and RDF ideas.  If you're really trying to  
> get things done in this space,  however,  it's amazing how  
> precarious the tools are.  For instance,  I looked at a large number  
> of data stores and wound up choosing MySQL based on two fairly  
> accidental facts:  (i) I couldn't get VARCHAR() or TEXT() fields in  
> other RDMS systems to handle the full length of Freebase text fields  
> in an indexable way,  and (ii) mongodb crashes and corrupts data.

All great observations :-)

>
> As for the linear pipelines,  the big issue I have is that I want to  
> process "facts" as complete chunks;  everything needed for one  
> particular bit of processing needs to get routed to the right  
> pipeline.  If it takes four triples involving a bnode to represent a  
> 'fact',  these all need to go to the same physical node.

Showing the problems with the triple model. Suppose we allowed  
arbitrary length tuples ala JSON, so each 'fact' is a single tuple.  
Would this make things easier? BTW, you might find the idea of an RDF  
molecule useful.

> As in the database case,  partitioning of data becomes a critical  
> issue,  but it becomes even more here that the partition a  
> particular triple falls in might be determined by the graph that  
> surrounds a triple,  which kind of points to a representation where  
> we (a) develop some mechanism for efficiently representing subgraphs  
> of related triples,  or (b) just give up on the whole triple thing  
> and use something like a relational or JSON model to represent facts.

Ah, I see you are thinking on the same lines. This is the more  
traditional model in any case, actually.

>
> (3) I've spoken to entrepreneurs and potential customers of semantic  
> technology and found that,  right now,  people want things that are  
> beyond the state of the art.  Often when I consult with people,  I  
> come to the conclusion that they haven't found the boundaries of  
> what they could accomplish through plain old "bag of words" and that  
> it's not so clear they'll do better with NLP/semantic tech.   
> Commonly,  these people have fundamental flaws in their business  
> model (thinking that they can pay some Turk $0.20 cents to do $2000  
> worth of marketing work.)  The most common theme in semantic  
> "product" companies that that they build complex systems out of  
> components that just barely work.
>
> I'll single out Zemanta for this,  although this is true of many  
> other companies.  Let's just estimate that Zemanta's service has 5  
> components and each of these is 85% accurate;  put those together,   
> and you've got a system that's just an embarrassment.  There are  
> multiple routes to solving this problem (either a "widening of the  
> scope" or a "narrowing of the scope" could help a lot) the fact is  
> that a lot of companies are aiming for that "sour spot" which has  
> the paradoxical dual effects that:  (i) some others imitate them,   
> and (ii) others write off the whole semantic space.  Success in  
> semantic technology is going to come from companies that find  
> fortuitous matches between "what's possible" and "what can be sold"

Brilliant!

> Another spectre that haunts the space is legacy "information  
> services" companies.  I've talked with many people who think they're  
> going to make big money selling a crappy product to undiscriminating  
> customers with deep pockets (U.S. Government,  Finance,   
> Pharma, ...)  I think the actual breakthroughs in semantic tech are  
> going to come from the disruptive direction:  people who find ways  
> to make things that are drastically cheaper than the old way,  but  
> that can accept the limitations of today's semantic tech.
>
> (4) I'm one of the people who got interested in semantic tech  
> because of DBPedia,  but yet,  I've also largely given up on  
> DBPedia.  One day I realized that I could,  with Freebase,  do  
> things in 20 minutes that would take 2 weeks of data cleanup with  
> DBPedia.  DBPedia 3.5/3.5.1 seems to be a large step backwards,   
> with major key integrity problems that are completely invisible to  
> 'open world' and OWL-paradigm systems.  I've wound up writing my own  
> framework for extracting 'facts' from wikipedia because DBPedia  
> isn't interested in extracting the things I want.  Every time I try  
> to do something with DBpedia,  I make shocking discoveries (for  
> instance,  "New York City",  "Berlin",  "Tokyo",  "Washington ,  
> D.C." and "Manchester, N.H." are not of rdf:type "City")  The fact  
> that I see so little complaining about this on the mailing list  
> seems to indicate that not a lot of people are trying to do real  
> work it.
>
> (5) It might make me a heretic,  but I've found that the closed  
> world assumption can,  properly used,  (i) do miracles,  and (ii)  
> directly confront many of the practical problems that show up in RDF  
> systems.

Indeed. What we need, its been clear for some time, is a globally open  
world with many smaller closed worlds inside it. But this needs a  
whole scheme/mechanism for how to say what the boundaries of these  
smaller closed worlds are, and what it is that they enclose, which has  
not been done and isn't likely to get done with the very conservative  
climate that we seem to be in right now.

>  OWL has greatly changed my thinking about schemas...  I'm less  
> concerned,  however,   about the official semantics of OWL,  than I  
> am about the general prospect of "reasoning about schemas."  I think  
> the "inference-based" model of OWL is awesome,  but pretty  
> frequently I find I need forms of reasoning that aren't quite  
> supported by OWL...

Can you give me any examples? I am trying to collect real-world but  
currently unsupported inference patterns, with the long-term goal of  
reengineering the semantics to make it fit what people think it ought  
to mean. So this is gold, for me.

>  Alternately,  data partitioning and data validation is really  
> important for me,  so I need something that has some of the nature  
> of an RDMS schema. Of course,  I can get some of this by "applying  
> my own hermeneutics" to OWL and adding some features

Again, details would be wonderful.

Pat Hayes


>
>

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Friday, 2 July 2010 16:11:05 UTC