Re: Inclusion of additional (non dereferencable) data? from Peter Ansell on 2010-06-10 (public-lod@w3.org from June 2010)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Fri, 11 Jun 2010 09:29:22 +1000
To: nathan@webr3.org
Cc: Linked Data community <public-lod@w3.org>
Message-ID: <AANLkTim6tx6ze6_KnFLZpcRVeaxl8S-S_qemyPSWMA1U@mail.gmail.com>
On 11 June 2010 01:24, Nathan <nathan@webr3.org> wrote:
> All,
>
> Here's a common example of what I'm referring to, suppose we have a (foaf)
> document http://ex.org/bobsmith which includes the following triples:
>
>  :me foaf:knows <http://example.org/joe_bloggs#me> .
>
>  <http://example.org/joe_bloggs#me> a foaf:Person ;
>    foaf:name "Joe Bloggs"@en .
>
> In Linked Data terms one could suggest that the description of Joe Bloggs
> doesn't 'belong' in this document (although clearly it can be here).
>
> I can quite easily see how trend came about, there are benefits, it's both
> an optimisation method (saves dereferencing) and it's an inclusion of human
> presentable information (which aids display / comprehension in 'foaf
> viewers').
>
> However, there are drawbacks too, the data could easily go out of date / out
> of sync, it's not dereferencable (the adverse effects in this example aren't
> specifically clear, but in other use-cases they are considerable).
>
> Over and above these simple thoughts, I'm quite sure that there are bigger
> architectural and best practise considerations (for a web of data), for
> example:
>
>  - does this create an environment where we are encouraged not to deference
> linked data (or where it is common to look local first)
>
>  - does this point to bigger issues such as not having a single global
> predicate for a default human presentable 'name' for all things that can be
> 'named' (given a URI) - even though many candidates are available.
>
>  - should 'reading ahead' (dereferencing all linked data before presentation
> to a user / trying to glean an understanding) be encouraged over providing a
> limited local subset of the data which could easily be inaccurate or out of
> date.
>
>  - is there an gut instinct in the community that most data will ultimately
> end up being presented to a human somewhere along the line, and this is
> driving us to make such design decisions.
>
> Any thoughts or strong feelings on the issue(s)? and is anybody aware of
> whether this practise came about more by accident than by design?

It is a very common ontology design pattern to avoid having to import
entire ontologies and the semantic consequences of doing so. That
probably isn't relevant to Linked Data though.

In terms of a default human presentable 'name' I would go no further
than rdfs:label as the basic predicate, and if people want to add
special semantics to their label they should sub property rdfs:label.
It is difficult if the predicate URI is not resolvable to
automatically determine whether it is a sub property of rdfs:label
though, but ideally it should be.

If all the document author wants to do is to add their label to a
resource that is not dereferenceable to that document then it may not
have a detrimental effect, but if they start adding meaningful
statements then the statements will only be discovered by accident. If
we are relying on accidental discovery to form part of the basis for
the Linked Data web then we have done something wrong.

In my opinion it would be much better if people just give up on the
idea of single URIs for each resource and make up a new URI whenever
they personally want to add properties to part of the description for
a resource but cannot directly add them to the dataset that is used by
the original author. Then the new URI is resolvable directly, ala
Linked Data first principles, and would need to be legitimately added
by others via the community social process, whether it is the producer
of the original URI or by others who think the properties are valuable
and worth linking to.

If there were multiple URIs for something, then there may be a case
for having each document contain the set of URIs that it knows to be
equivalent to a URI that actually appears in an RDF statement. For
example:

<blog:joe> <hasTopic> <blog:blogging> . <blog:blogging> <equivalentTo>
<db1:blogging> <db2:personalweblog> <db3:onlinenews>

As long as blog:blogging is resolvable to something that contains the
equivalency descriptions then it should be fine to add them into
blog:joe as well. Adding the equivalency descriptions to the document
resolved at blog:joe may be a good idea just incase the user doesn't
want to crawl endlessly before making use of the information they have
found.

It isn't necessarily that data will be presented to a human in the
end, but that the crawl strategy over Linked Data is not known a
priori. Some crawlers may only go 3 levels deep and then stop, and the
3rd level may have revealed blog:blogging, but not the implication
that it was the same as both db1:blogging and db2:personalweblog that
were discovered as part of 2nd or 1st crawl levels. Even if a crawler
goes to 50 levels they may still have the same difficulty.

Some documents may contain too many resources to even do 2 levels of
crawling, ie, crawl a URI and every URI in the resulting document.
This is the reason that I was told for DBpedia not including the very
valuable pagelinks dataset into the resolved DBpedia URIs, as there
were far too many URIs in the resulting documents that made it
difficult for even the most basic 2 level crawler to handle. In part
because if there are 300 URIs on a wikipedia page (conservative naive
estimate of a wikipedia article link count.... correct me if there are
no pages that get this high), then the crawler has to perform 300 URI
resolutions before being able to display the resulting page to a
human, as they may find no use in the URIs without labels. In part it
may be difficult to keep 301 RDF graphs in a typical user agents
memory just so that the user can interact with the application.

Even if the crawler is told that some URIs are more important than
others, it may have an absolute stopping case at 50 levels, and to go
any further would be against the users wishes or the users application
abilities such as disk space or RAM.

In both URI equivalency and human readable label cases I think it is a
useful optimisation. I don't think it is valuable in the case of
partial ontology imports, as the goal is to avoid dereferencing,
rather than add another method for discovering the information. I
think the original focus of RDF as a graph that just contains a set of
nodes and links between the nodes, regardless of its provenance, is
still valid, but it shouldn't be recommended in the case of Linked
Data. In Linked Data each of the nodes needs to be globally
discoverable, and the only generic way we have figured out for doing
that so far that seems to work is by using the node name (URI) to
discover more information. Reusing node names with novel information
attached to the node doesn't help this discovery process, so it should
be discouraged in the case of Linked Data, even though it is valid and
useful RDF if the user is aware of the pattern.

If you are willing to accept descriptions for things without
dereferencing them then you have to trust every datasoure you are
using, but that is another discussion. It is partly related to the
issue where labels would need to be kept up to date in order to be
valuable.

Cheers,

Peter
Received on Thursday, 10 June 2010 23:29:55 UTC