Re: Poisonous models (was the bad word)

Hi Hugh,

comment below,

On 19/07/10 08:22, Hugh Glaser wrote:
>> to answer to your question, Sindice will accept the document, perform
>> reasoning and index it as it is. However, Sindice is somehow robust to
>> this kind of "poisonous" data. Sindice is performing a particular kind
>> of reasoning that we call "context-dependent" reasoning [1], in which
>> inference is performed in the "context of the document". The inference
>> will only be true in the context of this document, and will not have a
>> global impact, i.e., will not alter the inference on other documents.
>> Therefore, Sindice avoids undesirable assertions. In fact, we do not
>> restrict the freedom of expression of data publishers as in other
>> approach like SAOR [2] where certain statements are considered invalid
>> and ignored.  Data publishers are allowed to reuse and extend ontologies
>> or existing entities in any manner, but the consequences of their
>> modifications will be confined in their own context, and will not alter
>> the intended semantics of the other RDF models published on the Web.
>>      
> Cool.
> Sounds really good that the inference part of Sindice is robust to this.
> Although I guess if I use Sindice to find relevant documents for
> dbpedia:Darby_Riordan and load them into my store, I am likely to end up
> with a pretty poisonned store.
>    
As you are saying, you are looking for relevant documents about 
dbpedia:Darby_Riordan. In this case, with an appropriate ranking, it is 
unlikely that poisonous/spamming documents will appear in the top-k results.
>> However, if somebody requests all documents stating<?s, owl:sameas,
>> dbpedia:Darby_Riordan>, Sindice will return you the document
>> http://data.totl.net/dave.rdf. But such problem can be tackled with
>> appropriate ranking methodologies (based on link analysis methods such
>> as [3]).
>> Poisonous documents published on the web are likely to not have
>> any incoming links (or only from other poisonous documents, but this can
>> be detected), and therefore will be ranked very low and will never
>> appear in the top-k search results.
>>      
> Not sure of this.
> Poisonous documents may well have many links to them (saying they are
> poisonous?).
>    
Good point, but in this case, it means that people agree on a certain 
vocabulary to point out poisonous documents. In this case, this 
information (meaning of the link) can be integrated into the ranking 
function. If a document has many incoming links, e.g.,  of type 
isPoisonous, then we can rank it lower.
After, finding the right ranking function is another problem (and 
interesting problem), but it is possible.
> This seems to me to be comparable to the citation problem, where a paper
> gets very high citations because everyone cites it as being wrong.
> Of course, sentiment analysis etc may help (and may be easier in the
> semantic web), but pure reference count is dangerous.
>    
The ranking should not be purely based on references, and it should also 
take into consideration the meaning of the links. Also, only taking the 
meaning of the links is dangerous.
For example, if I create a link to a dbpedia:document saying it is 
poisonous, why should people trust me ? However, if there is a multitude 
of links saying that the document is poisonous, then we can have more 
confidence in the fact that the document is really poisonous.

Regards,
-- 
Renaud Delbru

Received on Tuesday, 20 July 2010 09:57:25 UTC