Re: Inclusion of additional (non dereferencable) data? from Haijie.Peng on 2010-06-11 (public-lod@w3.org from June 2010)

From: Haijie.Peng <haijie.peng@gmail.com>
Date: Fri, 11 Jun 2010 12:46:38 +0800
To: Peter Ansell <ansell.peter@gmail.com>
CC: nathan@webr3.org, Linked Data community <public-lod@w3.org>
Message-ID: <4C11BFAE.5060400@gmail.com>
于 2010/6/11 7:29, Peter Ansell 写道:
> On 11 June 2010 01:24, Nathan<nathan@webr3.org>  wrote:
>    
>> All,
>>
>> Here's a common example of what I'm referring to, suppose we have a (foaf)
>> document http://ex.org/bobsmith which includes the following triples:
>>
>>   :me foaf:knows<http://example.org/joe_bloggs#me>  .
>>
>>   <http://example.org/joe_bloggs#me>  a foaf:Person ;
>>     foaf:name "Joe Bloggs"@en .
>>
>> In Linked Data terms one could suggest that the description of Joe Bloggs
>> doesn't 'belong' in this document (although clearly it can be here).
>>
>> I can quite easily see how trend came about, there are benefits, it's both
>> an optimisation method (saves dereferencing) and it's an inclusion of human
>> presentable information (which aids display / comprehension in 'foaf
>> viewers').
>>
>> However, there are drawbacks too, the data could easily go out of date / out
>> of sync, it's not dereferencable (the adverse effects in this example aren't
>> specifically clear, but in other use-cases they are considerable).
>>
>> Over and above these simple thoughts, I'm quite sure that there are bigger
>> architectural and best practise considerations (for a web of data), for
>> example:
>>
>>   - does this create an environment where we are encouraged not to deference
>> linked data (or where it is common to look local first)
>>
>>   - does this point to bigger issues such as not having a single global
>> predicate for a default human presentable 'name' for all things that can be
>> 'named' (given a URI) - even though many candidates are available.
>>
>>   - should 'reading ahead' (dereferencing all linked data before presentation
>> to a user / trying to glean an understanding) be encouraged over providing a
>> limited local subset of the data which could easily be inaccurate or out of
>> date.
>>
>>   - is there an gut instinct in the community that most data will ultimately
>> end up being presented to a human somewhere along the line, and this is
>> driving us to make such design decisions.
>>
>> Any thoughts or strong feelings on the issue(s)? and is anybody aware of
>> whether this practise came about more by accident than by design?
>>      
> It is a very common ontology design pattern to avoid having to import
> entire ontologies and the semantic consequences of doing so. That
> probably isn't relevant to Linked Data though.
>    
We might split those large ontologies into small pieces and link them by 
linked data.

> In terms of a default human presentable 'name' I would go no further
> than rdfs:label as the basic predicate,
there are cases that we don't need to name. We have to take into account 
this situation. However, we can't give everything named, which is beyond 
human ability.

> and if people want to add
> special semantics to their label they should sub property rdfs:label.
> It is difficult if the predicate URI is not resolvable to
> automatically determine whether it is a sub property of rdfs:label
> though, but ideally it should be.
>
> If all the document author wants to do is to add their label to a
> resource that is not dereferenceable to that document then it may not
> have a detrimental effect, but if they start adding meaningful
> statements then the statements will only be discovered by accident. If
> we are relying on accidental discovery to form part of the basis for
> the Linked Data web then we have done something wrong.
>
> In my opinion it would be much better if people just give up on the
> idea of single URIs for each resource and make up a new URI whenever
> they personally want to add properties to part of the description for
> a resource but cannot directly add them to the dataset that is used by
> the original author. Then the new URI is resolvable directly, ala
> Linked Data first principles, and would need to be legitimately added
> by others via the community social process, whether it is the producer
> of the original URI or by others who think the properties are valuable
> and worth linking to.
>
> If there were multiple URIs for something, then there may be a case
> for having each document contain the set of URIs that it knows to be
> equivalent to a URI that actually appears in an RDF statement. For
> example:
>
> <blog:joe>  <hasTopic>  <blog:blogging>  .<blog:blogging>  <equivalentTo>
> <db1:blogging>  <db2:personalweblog>  <db3:onlinenews>
>
> As long as blog:blogging is resolvable to something that contains the
> equivalency descriptions then it should be fine to add them into
> blog:joe as well. Adding the equivalency descriptions to the document
> resolved at blog:joe may be a good idea just incase the user doesn't
> want to crawl endlessly before making use of the information they have
> found.
>
> It isn't necessarily that data will be presented to a human in the
> end, but that the crawl strategy over Linked Data is not known a
> priori. Some crawlers may only go 3 levels deep and then stop, and the
> 3rd level may have revealed blog:blogging, but not the implication
> that it was the same as both db1:blogging and db2:personalweblog that
> were discovered as part of 2nd or 1st crawl levels. Even if a crawler
> goes to 50 levels they may still have the same difficulty.
>    
This shows that crawling the web isn't a good way. we need choose other 
ways to search things. This is a deficiency of the current internet 
infrastructure(HTML-web server-browser-search engine). it's a little bit 
like we can't give everything named, we can't index everything too. A 
feasible method is only to index what we concerned.

> Some documents may contain too many resources to even do 2 levels of
> crawling, ie, crawl a URI and every URI in the resulting document.
> This is the reason that I was told for DBpedia not including the very
> valuable pagelinks dataset into the resolved DBpedia URIs, as there
> were far too many URIs in the resulting documents that made it
> difficult for even the most basic 2 level crawler to handle. In part
> because if there are 300 URIs on a wikipedia page (conservative naive
> estimate of a wikipedia article link count.... correct me if there are
> no pages that get this high), then the crawler has to perform 300 URI
> resolutions before being able to display the resulting page to a
> human, as they may find no use in the URIs without labels. In part it
> may be difficult to keep 301 RDF graphs in a typical user agents
> memory just so that the user can interact with the application.
>
> Even if the crawler is told that some URIs are more important than
> others, it may have an absolute stopping case at 50 levels, and to go
> any further would be against the users wishes or the users application
> abilities such as disk space or RAM.
>    
This result is inevitable under the current internet infrastructure(as 
said above).Let us consider this scenario: Some information/knowledge 
needs to be retrieved, but there is no need to display it. This means 
that we need to reconsider the representation ways and organization ways 
of web of data.

> In both URI equivalency and human readable label cases I think it is a
> useful optimisation. I don't think it is valuable in the case of
> partial ontology imports, as the goal is to avoid dereferencing,
> rather than add another method for discovering the information. I
> think the original focus of RDF as a graph that just contains a set of
> nodes and links between the nodes, regardless of its provenance, is
> still valid, but it shouldn't be recommended in the case of Linked
> Data. In Linked Data each of the nodes needs to be globally
> discoverable, and the only generic way we have figured out for doing
> that so far that seems to work is by using the node name (URI) to
> discover more information. Reusing node names with novel information
> attached to the node doesn't help this discovery process, so it should
> be discouraged in the case of Linked Data, even though it is valid and
> useful RDF if the user is aware of the pattern.
>
> If you are willing to accept descriptions for things without
> dereferencing them then you have to trust every datasoure you are
> using, but that is another discussion. It is partly related to the
> issue where labels would need to be kept up to date in order to be
> valuable.
>    
We can sign each datasource to solve this problem("have to trust every 
datasource"), but it would panic the performance of application.

my two cents.

regards

   Peng

> Cheers,
>
> Peter
>
>
Received on Friday, 11 June 2010 04:47:19 UTC