Re: Managing Co-reference (Was: A Semantic Elephant?)

Tim,

I'm happy for this thread to be forwarded.

I already posted a summary so far,  a message or two ago.  Here it is again
(updated), for your convenience and for this new sub-thread.  For me, the
punch line is that:

   - to some extent, URI proliferation is inevitable in an open web.  It is
   probably the only way to allow people to independently publish web data.
   - We will have to rely on  acombination of manual effort and
   sophisticated automated methods to detect clashes, and attempt to resolve
   them.
   - large scale duplication of URIs as in the case of YAGO and DBpedia, is
   really bad news. That it happened at all is due to this all being very new.
   - *THE TIME IS NOW:* to put some infrastructure or guidelines in place to
   avoid this. It is already starting but much more needs to be done. A good
   example of this is the UMBEL project.

Overall, there was mainly agreement in the positions stated, no competing
'camps' emerged.
*
<Begin Original Summary>*

In my original post, I claimed that proliferation of URIs causes two
specific problems:

   - *Problem 1)* it is hard to find when two things should be the same and
   - *Problem 2)* even if you can find the links, prolific use of owl:sameAs
   will create computational problems.

Below is a summary of the responses so far.  What is interesting is that
no-one else agreed that problem 2 is real. I address that point after this
summary.

*ChrisB:  *

   - Problem 1 is not really so bad, for there is much matching technology
   is out there that can be used, albeit there will be some limits on
   precision.
   - Problem 2 is not a problem either  because noone is going to load
   everything into a single store.

*FrankvH:*

   - Problem 1 is very real, but is only recently becoming a problem with
   the recent surge of semantic web data coming on line. He disagrees with
   ChrisB's optimism. Also, there are two issues that need to be handled
   differently
   - matching at the schema/class level
      - matching instances

Frank refers to some good work going on in addressing these issues, not by
matching after the fact, but by elminiting the proliferation at source.

[1] http://sindice.com/
[2] http://www.sindice.com/pdf/sindice-ijmso2008.pdf
[3] http://www.okkam.org/
[4] http://www.okkam.org/IRSW2008

   - Problem 2: no comment


*ChrisBizer* says:

   - My optimism was more about instance level identity links, [rather than
   the class level]. Within the LOD effort we repeatedly run into situations
   where it is really easy to generate owl:sameAs links based on some simple
   domain-dependent rules.

*KingslyI* explains:

   - The URL problems are being addressed, and refers to the UMBEL project.
   Wikipedia, OpenCye, WordNet and Yago Ideitifiers are being rationalized.
   http://www.umbel.org/announcement.xhtml

*FredG *notes:

   - There are edge cases when it is not immediately clear, even for a
   human, to decide what deserves a unique URI. He also notes that technical
   documentation from the UMBEL project is due sometime before or during late
   May or June.

*<End of Original Summary - next are a few updates>

**JimHendler *summarized things this way: "So what you are really saying is
scaling is a technology/research challenge now that there's much more out
there. We need to go beyond just triple stores and get some fast inferencing
at Web scales. Makes sense to me."

*MikeUschold *noted that the computational issue of owl:sameAs proliferation
is a major problem, even if noone is going to load all the semantic web data
into a single store.  For today's triple stores that do limited inference,
owl:sameAs "has a significant run time" according both to common sense as
well as the developers of OpenLink's Virtuoso triple
store<http://docs.openlinksw.com/virtuoso/rdfsparqlrule.html#rdfsparqlruleintro>.
They say it can easily double query
times<http://www.openlinksw.com/weblog/oerling/?id=1347>.


*MikeUschold *also noted that company's building and delivering software
products that uses public data will have to bring it in house to control
it.  This is a response to Chris Bizer's comment that
*"*noone is going to load ALL the data into a single store, so why worry?"
For example, I can't imagine that Powerset relies on the data sitting on the
DBpedia servers. More likely, they loaded the triples into their own
system.  Proliferation of URIs on a large scale will cause performance
issues.

*Soren Auer *notes that even with such proliferation, people will be able to
build useful applications. He also notes that: "Once, certain information
sources are established (and for that page rank inspired data rank
algorithms could be developed) - people will automatically tend to reuse
established identifiers and this will counteract the proliferation."


Michael

** Change management is the other elephant in the room.


On Wed, May 14, 2008 at 5:14 AM, Tim Berners-Lee <timbl@w3.org> wrote:

> This is a great discussion ... would each contributor be OK with the
> discussion to date being sent to something public so it is on the record?
>  Like semantic-web@w3.org ?
> And would someone like to compile a "Story so far".
>
> (travelling and unable to really say all I want to say at this point but
> the web of multiple overlapping communities of scale-free distribution is
>  key to understanding why this works for finite effort. See "total cost of
> ontologies" in http://www.w3.org/DesignIssues/Fractal.html and in
> http://www.w3.org/2005/Talks/1110-iswc-tbl/ slide 11
> So multiple URIs for the same thing is life, a constant tradeoff, but life
> is, on balance good.
> But we need different systems for people, as for books as fro proteins
> because socially the situations are so different. And web science is about
> designing tech in the context of the social system you have or you plan.)
>
> Tim
>
>
>
> On 2008-05 -14, at 03:29, Sören Auer wrote:
>
>  Michael F Uschold wrote:
>>
>>> [..]
>>>
>>

Received on Wednesday, 14 May 2008 17:10:07 UTC