implied datasets from William Waites on 2011-05-23 (public-lod@w3.org from May 2011)

From: William Waites <ww@styx.org>
Date: Mon, 23 May 2011 15:01:40 +0200
To: public-lod@w3.org
Message-ID: <20110523130140.GL76920@styx.org>
This is the RDF version of the question I just sent to the CKAN list
[1]. It is somewhat a policy question and I believe that in RDF terms
the open world means the answer is basically, "yes you can say what
you want".

Consider the diagram here,

  http://semantic.ckan.net/group/?group=http://ckan.net/group/lld

this is interconnections between library datasets. You'll notice there
is a partition. This partition is not really there.

Here's why. In library world, perhaps more than elsewhere, it is
common to do things like this,

<http://example.org/issn/1234-5678> a bibo:Jornal;
    blah blah blah some descriptions;
    owl:sameAs <urn:issn:1234-5678>.

This is because there are standard identifiers for lots of things that
are found in libraries and they even have a urn namespace. So it is a
lot easier when publishing this data than to go out and use something
like silk to try to find links. They're already implied by the
identifiers we have in hand.

So given two such datasets, they are indeed connected in the way we
think of RDF datasets as being connected, not necessarily with
semantics as strict as owl:sameAs - we would probably not choose to
actually materialise its productions here especially since the
entities might be modelled in different, incompatible ways, and the
owl:sameAs is really not the right predicate to be using, but at least
connected with semantics along the lines of rdfs:seeAlso. The point
is, the two datasets are transitively connected.

But because we have no extant dataset that contains all the ISSNs,
particularly all ISSNs where the identifier is expressed as a urn:
URI, we have nothing to put in our voiD linkset -- which is how the
relationships between these datasets are represented at a high
level. So we have an apparent partition.

What I propose to do here, is invent an implied dataset, the one that
contains in principle the entire list of ISSNs. Something like,

    <urn:issn:0000-0000> a rdf:Resource.
    <urn:issn:0000-0001> a rdf:Resource.
    ...

but which actually should contain X a rdf:Resource for everything in
the valid lexical space of urn:issn, which may be (countably) infinite
for all I know.

Then for each dataset that I have that uses the links to this space, I
count them up and make a linkset pointing at this imaginary dataset.

Obviously the same strategy for anywhere there exist some kind of
standard identifiers that are not URIs in HTTP.

Does this make sense?

Can we sensibly talk about and even assert the existence of a dataset
of infinite size? (whatever "existence" means).

Is this an abuse of DCat/voiD?

Are this class of datasets subsets of sameAs.org (assuming sameAs.org
to be complete in principle?)

Cheers,
-w

[1] http://lists.okfn.org/pipermail/ckan-discuss/2011-May/001269.html
-- 
William Waites                <mailto:ww@styx.org>
http://river.styx.org/ww/        <sip:ww@styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45
Received on Monday, 23 May 2011 13:02:04 UTC