- From: Paul Houle <ontology2@gmail.com>
- Date: Fri, 2 Jul 2010 11:07:41 -0400
- To: Linked Data community <public-lod@w3.org>
- Message-ID: <AANLkTinVqT1tX3w2ofoLuy1-NhJxfEMTUKEVAsEDYxDT@mail.gmail.com>
Here are some of my thoughts (1) The global namespace in RDF plus the concept that "most knowledge can be efficiently represented with triples" are brilliant; in the long term we're going to see these two concepts diffuse into non-RDF systems because they are so powerful. I appreciate the way multiple languages are implemented in RDF -- although imperfect, it's a big improvement over what I've had to do to implement multi-lingual "digital libraries" on relational systems. (2) Yet, the "big graph" and triple paradigms run into big problems when we try to build real systems. There are two paradigms I work in: (i) storing 'facts' in a database, and (ii) processing 'facts' through pipelines that effectively do one or more "full scans" of data; type (ii) processes can be highly scalable, however, when they can be parallelized. Now, if hardware cost was no object, I suppose I could keep triples in a huge distributed main-memory database. Right now, I can't afford that. (If I get richer and if hardware gets cheaper, I'll probably want to handle more data, putting me back where I started...) Today I can get 100x performance increases by physically partitioning data in ways that reflect the way I'm going to use it. Relational databases are highly mature at this, but RDF systems barely recognize that there's an issue. Named graphs are a step forward in this direction, but to make something that's really useful we'd need both (a) the ability to do graph algebra, and (b) the ability to automatically partition 'facts' into graphs. That 'automatic' could be something similar to RDBMS practice ("put this kind of predicate in that graph", "put triples with this sort of subject in that graph") or it could be something really 'intelligent', that can infer likely use patterns by reasoning over the schema and/or by adaptive profiling of actual use (as Salesforce.com does to build a pretty awesome OLTP system on top of what's a triple store at the core.) Practically, I deal with this by building hybrid systems that combine both relational and RDF ideas. If you're really trying to get things done in this space, however, it's amazing how precarious the tools are. For instance, I looked at a large number of data stores and wound up choosing MySQL based on two fairly accidental facts: (i) I couldn't get VARCHAR() or TEXT() fields in other RDMS systems to handle the full length of Freebase text fields in an indexable way, and (ii) mongodb crashes and corrupts data. As for the linear pipelines, the big issue I have is that I want to process "facts" as complete chunks; everything needed for one particular bit of processing needs to get routed to the right pipeline. If it takes four triples involving a bnode to represent a 'fact', these all need to go to the same physical node. As in the database case, partitioning of data becomes a critical issue, but it becomes even more here that the partition a particular triple falls in might be determined by the graph that surrounds a triple, which kind of points to a representation where we (a) develop some mechanism for efficiently representing subgraphs of related triples, or (b) just give up on the whole triple thing and use something like a relational or JSON model to represent facts. (3) I've spoken to entrepreneurs and potential customers of semantic technology and found that, right now, people want things that are beyond the state of the art. Often when I consult with people, I come to the conclusion that they haven't found the boundaries of what they could accomplish through plain old "bag of words" and that it's not so clear they'll do better with NLP/semantic tech. Commonly, these people have fundamental flaws in their business model (thinking that they can pay some Turk $0.20 cents to do $2000 worth of marketing work.) The most common theme in semantic "product" companies that that they build complex systems out of components that just barely work. I'll single out Zemanta for this, although this is true of many other companies. Let's just estimate that Zemanta's service has 5 components and each of these is 85% accurate; put those together, and you've got a system that's just an embarrassment. There are multiple routes to solving this problem (either a "widening of the scope" or a "narrowing of the scope" could help a lot) the fact is that a lot of companies are aiming for that "sour spot" which has the paradoxical dual effects that: (i) some others imitate them, and (ii) others write off the whole semantic space. Success in semantic technology is going to come from companies that find fortuitous matches between "what's possible" and "what can be sold" Another spectre that haunts the space is legacy "information services" companies. I've talked with many people who think they're going to make big money selling a crappy product to undiscriminating customers with deep pockets (U.S. Government, Finance, Pharma, ...) I think the actual breakthroughs in semantic tech are going to come from the disruptive direction: people who find ways to make things that are drastically cheaper than the old way, but that can accept the limitations of today's semantic tech. (4) I'm one of the people who got interested in semantic tech because of DBPedia, but yet, I've also largely given up on DBPedia. One day I realized that I could, with Freebase, do things in 20 minutes that would take 2 weeks of data cleanup with DBPedia. DBPedia 3.5/3.5.1 seems to be a large step backwards, with major key integrity problems that are completely invisible to 'open world' and OWL-paradigm systems. I've wound up writing my own framework for extracting 'facts' from wikipedia because DBPedia isn't interested in extracting the things I want. Every time I try to do something with DBpedia, I make shocking discoveries (for instance, "New York City", "Berlin", "Tokyo", "Washington , D.C." and "Manchester, N.H." are not of rdf:type "City") The fact that I see so little complaining about this on the mailing list seems to indicate that not a lot of people are trying to do real work it. (5) It might make me a heretic, but I've found that the closed world assumption can, properly used, (i) do miracles, and (ii) directly confront many of the practical problems that show up in RDF systems. OWL has greatly changed my thinking about schemas... I'm less concerned, however, about the official semantics of OWL, than I am about the general prospect of "reasoning about schemas." I think the "inference-based" model of OWL is awesome, but pretty frequently I find I need forms of reasoning that aren't quite supported by OWL... Alternately, data partitioning and data validation is really important for me, so I need something that has some of the nature of an RDMS schema. Of course, I can get some of this by "applying my own hermeneutics" to OWL and adding some features
Received on Friday, 2 July 2010 15:08:14 UTC