Re: Poisonous models (was the bad word) from Renaud Delbru on 2010-07-18 (public-lod@w3.org from July 2010)

From: Renaud Delbru <renaud.delbru@deri.org>
Date: Sun, 18 Jul 2010 17:49:08 +0100
To: Hugh Glaser <hg@ecs.soton.ac.uk>
CC: Daniël Bos <corani@gmail.com>, Linked Data community <public-lod@w3.org>
Message-ID: <4C433084.6030209@deri.org>

Hi Hugh,

to answer to your question, Sindice will accept the document, perform 
reasoning and index it as it is. However, Sindice is somehow robust to 
this kind of "poisonous" data. Sindice is performing a particular kind 
of reasoning that we call "context-dependent" reasoning [1], in which 
inference is performed in the "context of the document". The inference 
will only be true in the context of this document, and will not have a 
global impact, i.e., will not alter the inference on other documents. 
Therefore, Sindice avoids undesirable assertions. In fact, we do not 
restrict the freedom of expression of data publishers as in other 
approach like SAOR [2] where certain statements are considered invalid 
and ignored.  Data publishers are allowed to reuse and extend ontologies 
or existing entities in any manner, but the consequences of their 
modifications will be confined in their own context, and will not alter 
the intended semantics of the other RDF models published on the Web.

However, if somebody requests all documents stating <?s, owl:sameas, 
dbpedia:Darby_Riordan>, Sindice will return you the document 
http://data.totl.net/dave.rdf. But such problem can be tackled with 
appropriate ranking methodologies (based on link analysis methods such 
as [3]). Poisonous documents published on the web are likely to not have 
any incoming links (or only from other poisonous documents, but this can 
be detected), and therefore will be ranked very low and will never 
appear in the top-k search results.

[1] http://renaud.delbru.fr/doc/pub/SSWS2008-context.pdf
[2] http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-21.pdf
[3] http://renaud.delbru.fr/doc/pub/eswc2010-ding.pdf

Regards,
-- 
Renaud Delbru

On 18/07/10 16:58, Hugh Glaser wrote:
> Sure, Nathan may be.
> But Richard and Toby moved into the poisoning world.
> You can only use the techniques you describe if you have concepts of where things can/can't come from.
> And as Toby says, if Google (or Sindice) took this...
> What does happen if Sindice accepts this document?
>
> Hugh
>
> On 18 Jul 2010, at 05:54, "Daniël Bos"<corani@gmail.com<mailto:corani@gmail.com>>  wrote:
>
>
> I think Nathan isn't talking about poisoning models (which could be prevented using reification, or using quads, which include the source of the statement, and then only trust selected statements), but about the problem of giving spammers a tool to much easier collect email and postal addresses from the web, by simply parsing pages instead of scraping and somehow detecting the information.
>
> Though I can see the danger in that, I personally don't think it is that much of an issue, since email addresses have always been easy to scrape, and postal addresses are in most cases easy to collect from e.g. business directories. Semantic markup makes it easier, but those wanting to collect this kind of data could and would do that anyway.
>
> --
> With kind regards,
> Daniël Bos
>
> On Jul 18, 2010 12:55 AM, "Hugh Glaser"<<mailto:hg@ecs.soton.ac.uk>hg@ecs.soton.ac.uk<mailto:hg@ecs.soton.ac.uk>>  wrote:
>
> You better hope your system can cope with this.
> <http://data.totl.net/dave.rdf>http://data.totl.net/dave.rdf
>
> Hugh
>
> On 17 Jul 2010, at 11:35, "Nathan"<<mailto:nathan@webr3.org>nathan@webr3.org<mailto:nathan@webr3.org>>  wrote:
>
>    
>> So, after seeing this question on s...
>>

Received on Sunday, 18 July 2010 16:49:49 UTC