- From: Michael F Uschold <uschold@gmail.com>
- Date: Wed, 14 May 2008 09:06:36 -0700
- To: "Tim Berners-Lee" <timbl@w3.org>
- Cc: "Sören Auer" <auer@informatik.uni-leipzig.de>, "Chris Bizer" <chris@bizer.de>, "Frank van Harmelen" <Frank.van.Harmelen@cs.vu.nl>, "Kingsley Idehen" <kidehen@openlinksw.com>, "SW-forum Web" <semantic-web@w3.org>, "Fabian M. Suchanek" <f.m.suchanek@gmail.com>, "Tim Berners-Lee" <timbl@csail.mit.edu>, "Jim Hendler" <hendler@cs.rpi.edu>, "Mark Greaves" <markg@vulcan.com>, georgi.kobilarov@gmx.de, "Jens Lehmann" <lehmann@informatik.uni-leipzig.de>, "Richard Cyganiak" <richard@cyganiak.de>, "Frederick Giasson" <fred@fgiasson.com>, "Michael Bergman" <mike@mkbergman.com>, "Conor Shankey" <cshankey@reinvent.com>, "Kira Oujonkova" <koujonkova@reinvent.com>, "Aldo Gangemi" <aldo.gangemi@istc.cnr.it>
- Message-ID: <406b38b50805140906qe77ac49lf463c56010647181@mail.gmail.com>
Tim, I'm happy for this thread to be forwarded. I already posted a summary so far, a message or two ago. Here it is again (updated), for your convenience and for this new sub-thread. For me, the punch line is that: - to some extent, URI proliferation is inevitable in an open web. It is probably the only way to allow people to independently publish web data. - We will have to rely on acombination of manual effort and sophisticated automated methods to detect clashes, and attempt to resolve them. - large scale duplication of URIs as in the case of YAGO and DBpedia, is really bad news. That it happened at all is due to this all being very new. - *THE TIME IS NOW:* to put some infrastructure or guidelines in place to avoid this. It is already starting but much more needs to be done. A good example of this is the UMBEL project. Overall, there was mainly agreement in the positions stated, no competing 'camps' emerged. * <Begin Original Summary>* In my original post, I claimed that proliferation of URIs causes two specific problems: - *Problem 1)* it is hard to find when two things should be the same and - *Problem 2)* even if you can find the links, prolific use of owl:sameAs will create computational problems. Below is a summary of the responses so far. What is interesting is that no-one else agreed that problem 2 is real. I address that point after this summary. *ChrisB: * - Problem 1 is not really so bad, for there is much matching technology is out there that can be used, albeit there will be some limits on precision. - Problem 2 is not a problem either because noone is going to load everything into a single store. *FrankvH:* - Problem 1 is very real, but is only recently becoming a problem with the recent surge of semantic web data coming on line. He disagrees with ChrisB's optimism. Also, there are two issues that need to be handled differently - matching at the schema/class level - matching instances Frank refers to some good work going on in addressing these issues, not by matching after the fact, but by elminiting the proliferation at source. [1] http://sindice.com/ [2] http://www.sindice.com/pdf/sindice-ijmso2008.pdf [3] http://www.okkam.org/ [4] http://www.okkam.org/IRSW2008 - Problem 2: no comment *ChrisBizer* says: - My optimism was more about instance level identity links, [rather than the class level]. Within the LOD effort we repeatedly run into situations where it is really easy to generate owl:sameAs links based on some simple domain-dependent rules. *KingslyI* explains: - The URL problems are being addressed, and refers to the UMBEL project. Wikipedia, OpenCye, WordNet and Yago Ideitifiers are being rationalized. http://www.umbel.org/announcement.xhtml *FredG *notes: - There are edge cases when it is not immediately clear, even for a human, to decide what deserves a unique URI. He also notes that technical documentation from the UMBEL project is due sometime before or during late May or June. *<End of Original Summary - next are a few updates> **JimHendler *summarized things this way: "So what you are really saying is scaling is a technology/research challenge now that there's much more out there. We need to go beyond just triple stores and get some fast inferencing at Web scales. Makes sense to me." *MikeUschold *noted that the computational issue of owl:sameAs proliferation is a major problem, even if noone is going to load all the semantic web data into a single store. For today's triple stores that do limited inference, owl:sameAs "has a significant run time" according both to common sense as well as the developers of OpenLink's Virtuoso triple store<http://docs.openlinksw.com/virtuoso/rdfsparqlrule.html#rdfsparqlruleintro>. They say it can easily double query times<http://www.openlinksw.com/weblog/oerling/?id=1347>. *MikeUschold *also noted that company's building and delivering software products that uses public data will have to bring it in house to control it. This is a response to Chris Bizer's comment that *"*noone is going to load ALL the data into a single store, so why worry?" For example, I can't imagine that Powerset relies on the data sitting on the DBpedia servers. More likely, they loaded the triples into their own system. Proliferation of URIs on a large scale will cause performance issues. *Soren Auer *notes that even with such proliferation, people will be able to build useful applications. He also notes that: "Once, certain information sources are established (and for that page rank inspired data rank algorithms could be developed) - people will automatically tend to reuse established identifiers and this will counteract the proliferation." Michael ** Change management is the other elephant in the room. On Wed, May 14, 2008 at 5:14 AM, Tim Berners-Lee <timbl@w3.org> wrote: > This is a great discussion ... would each contributor be OK with the > discussion to date being sent to something public so it is on the record? > Like semantic-web@w3.org ? > And would someone like to compile a "Story so far". > > (travelling and unable to really say all I want to say at this point but > the web of multiple overlapping communities of scale-free distribution is > key to understanding why this works for finite effort. See "total cost of > ontologies" in http://www.w3.org/DesignIssues/Fractal.html and in > http://www.w3.org/2005/Talks/1110-iswc-tbl/ slide 11 > So multiple URIs for the same thing is life, a constant tradeoff, but life > is, on balance good. > But we need different systems for people, as for books as fro proteins > because socially the situations are so different. And web science is about > designing tech in the context of the social system you have or you plan.) > > Tim > > > > On 2008-05 -14, at 03:29, Sören Auer wrote: > > Michael F Uschold wrote: >> >>> [..] >>> >>
Received on Wednesday, 14 May 2008 17:10:07 UTC