Re: Inclusion of additional (non dereferencable) data? from Nathan on 2010-06-11 (public-lod@w3.org from June 2010)

From: Nathan <nathan@webr3.org>
Date: Fri, 11 Jun 2010 10:24:21 +0100
To: Peter Ansell <ansell.peter@gmail.com>
CC: Linked Data community <public-lod@w3.org>
Message-ID: <4C1200C5.6050002@webr3.org>
Peter Ansell wrote:
> On 11 June 2010 01:24, Nathan <nathan@webr3.org> wrote:
>> All,
>>
>> Here's a common example of what I'm referring to, suppose we have a (foaf)
>> document http://ex.org/bobsmith which includes the following triples:
>>
>>  :me foaf:knows <http://example.org/joe_bloggs#me> .
>>
>>  <http://example.org/joe_bloggs#me> a foaf:Person ;
>>    foaf:name "Joe Bloggs"@en .
>>
>> In Linked Data terms one could suggest that the description of Joe Bloggs
>> doesn't 'belong' in this document (although clearly it can be here).
>>
>> I can quite easily see how trend came about, there are benefits, it's both
>> an optimisation method (saves dereferencing) and it's an inclusion of human
>> presentable information (which aids display / comprehension in 'foaf
>> viewers').
>>
>> However, there are drawbacks too, the data could easily go out of date / out
>> of sync, it's not dereferencable (the adverse effects in this example aren't
>> specifically clear, but in other use-cases they are considerable).
>>
>> Over and above these simple thoughts, I'm quite sure that there are bigger
>> architectural and best practise considerations (for a web of data), for
>> example:
>>
>>  - does this create an environment where we are encouraged not to deference
>> linked data (or where it is common to look local first)
>>
>>  - does this point to bigger issues such as not having a single global
>> predicate for a default human presentable 'name' for all things that can be
>> 'named' (given a URI) - even though many candidates are available.
>>
>>  - should 'reading ahead' (dereferencing all linked data before presentation
>> to a user / trying to glean an understanding) be encouraged over providing a
>> limited local subset of the data which could easily be inaccurate or out of
>> date.
>>
>>  - is there an gut instinct in the community that most data will ultimately
>> end up being presented to a human somewhere along the line, and this is
>> driving us to make such design decisions.
>>
>> Any thoughts or strong feelings on the issue(s)? and is anybody aware of
>> whether this practise came about more by accident than by design?
> 
> It is a very common ontology design pattern to avoid having to import
> entire ontologies and the semantic consequences of doing so. That
> probably isn't relevant to Linked Data though.
> 
> In terms of a default human presentable 'name' I would go no further
> than rdfs:label as the basic predicate, and if people want to add
> special semantics to their label they should sub property rdfs:label.

Fully agree and glad you suggested rdfs:label

> It is difficult if the predicate URI is not resolvable to
> automatically determine whether it is a sub property of rdfs:label
> though, but ideally it should be.
> 
> If all the document author wants to do is to add their label to a
> resource that is not dereferenceable to that document then it may not
> have a detrimental effect, but if they start adding meaningful
> statements then the statements will only be discovered by accident. If
> we are relying on accidental discovery to form part of the basis for
> the Linked Data web then we have done something wrong.
> 
> In my opinion it would be much better if people just give up on the
> idea of single URIs for each resource and make up a new URI whenever
> they personally want to add properties to part of the description for
> a resource but cannot directly add them to the dataset that is used by
> the original author. Then the new URI is resolvable directly, ala
> Linked Data first principles, and would need to be legitimately added
> by others via the community social process, whether it is the producer
> of the original URI or by others who think the properties are valuable
> and worth linking to.

Single URLs or Single URIs (ie are you thinking slash or frag here)

> If there were multiple URIs for something, then there may be a case
> for having each document contain the set of URIs that it knows to be
> equivalent to a URI that actually appears in an RDF statement. For
> example:
> 
> <blog:joe> <hasTopic> <blog:blogging> . <blog:blogging> <equivalentTo>
> <db1:blogging> <db2:personalweblog> <db3:onlinenews>
> 
> As long as blog:blogging is resolvable to something that contains the
> equivalency descriptions then it should be fine to add them into
> blog:joe as well. Adding the equivalency descriptions to the document
> resolved at blog:joe may be a good idea just incase the user doesn't
> want to crawl endlessly before making use of the information they have
> found.

will come back to this one :)

> It isn't necessarily that data will be presented to a human in the
> end, but that the crawl strategy over Linked Data is not known a
> priori. Some crawlers may only go 3 levels deep and then stop, and the
> 3rd level may have revealed blog:blogging, but not the implication
> that it was the same as both db1:blogging and db2:personalweblog that
> were discovered as part of 2nd or 1st crawl levels. Even if a crawler
> goes to 50 levels they may still have the same difficulty.
> 
> Some documents may contain too many resources to even do 2 levels of
> crawling, ie, crawl a URI and every URI in the resulting document.
> This is the reason that I was told for DBpedia not including the very
> valuable pagelinks dataset into the resolved DBpedia URIs, as there
> were far too many URIs in the resulting documents that made it
> difficult for even the most basic 2 level crawler to handle. In part
> because if there are 300 URIs on a wikipedia page (conservative naive
> estimate of a wikipedia article link count.... correct me if there are
> no pages that get this high), then the crawler has to perform 300 URI
> resolutions before being able to display the resulting page to a
> human, as they may find no use in the URIs without labels. In part it
> may be difficult to keep 301 RDF graphs in a typical user agents
> memory just so that the user can interact with the application.
> 
> Even if the crawler is told that some URIs are more important than
> others, it may have an absolute stopping case at 50 levels, and to go
> any further would be against the users wishes or the users application
> abilities such as disk space or RAM.
> 
> In both URI equivalency and human readable label cases I think it is a
> useful optimisation. I don't think it is valuable in the case of
> partial ontology imports, as the goal is to avoid dereferencing,
> rather than add another method for discovering the information. I
> think the original focus of RDF as a graph that just contains a set of
> nodes and links between the nodes, regardless of its provenance, is
> still valid, but it shouldn't be recommended in the case of Linked
> Data. In Linked Data each of the nodes needs to be globally
> discoverable, and the only generic way we have figured out for doing
> that so far that seems to work is by using the node name (URI) to
> discover more information. Reusing node names with novel information
> attached to the node doesn't help this discovery process, so it should
> be discouraged in the case of Linked Data, even though it is valid and
> useful RDF if the user is aware of the pattern.
> 
> If you are willing to accept descriptions for things without
> dereferencing them then you have to trust every datasoure you are
> using, but that is another discussion. It is partly related to the
> issue where labels would need to be kept up to date in order to be
> valuable.

Perhaps a discussion worth having sooner rather than later..

I'd suggest that it's almost second nature for us to to think of links 
as links in the <a href="link"> sense, and with linked data to think of 
every link as <a href="link" rel="predicate">omg where's the label</a>. 
However I'd put forward that this is purely because of the huge role 
HTML has had in the web thus far, if you consider that when presenting 
linked data to humans we will want to present more than just some label, 
depending on what the linked-to resource is, the context under which we 
are showing it, the presenting applications capabilities and so forth, 
we'll more than likely want to be presenting much more than just the 
human readable name. - In all use cases which consider this, 
dereferencing is the only way to provide what's needed

I'd also suggest that focus on 'linked data explorers' has given us a 
thought that we must find a way to show all links in a human readable 
way; but if we think of future linked data clients as simply 'web 
applications', where each web application has a task to do, and will 
almost always only be considering specific types of information and 
relations, then the amount of dereferencing that needs done to present 
what the human is interested in, under the specific context of the 
applications (current) role, is somewhat reduced and more manageable.

To be specific, I've put some thought in to this, and if you consider a 
typical web 2.0 style blog page, often you have 30+ resources which are 
all dereferenced and pulled in to show an HTML page (often many, many 
more) - css, images, javascript, videos, adverts etc - and each of those 
resources will typically consume a comparatively huge amount of 
bandwidth compared to small granular rdf documents; thus I'd suggest 
that display linked data resources, with all related resources which are 
to be considered under the context of the current query/task, is 
actually a non-issue, and in fact far lighter on the agent than the 
current state of the web of documents.

I'll stop here before I go off on a tangent,

Best,

Nathan
Received on Friday, 11 June 2010 09:25:32 UTC