Re: Comments (with use case) from Hugh Glaser on 2017-01-06 (public-dwbp-comments@w3.org from January 2017)

From: Hugh Glaser <hugh@glasers.org>
Date: Fri, 6 Jan 2017 14:05:08 +0000
To: Phil Archer <phila@w3.org>
Cc: public-dwbp-comments@w3.org, POE Comment list <public-poe-comments@w3.org>
Message-Id: <11ECC485-3A45-4038-B80D-F13C85DEBA0C@glasers.org>
Hi Phil,

> On 5 Jan 2017, at 11:22, Phil Archer <phila@w3.org> wrote:
> 
> + POE WG
> 
> Thanks again for posting this, Hugh.
Pleasure.
> 
> Now that I've actually read and thought about what you've said here, I think DWBP covers the topic as far as it will. BP4 [1] simply stresses the importance of including licence info. In the NYT example, yes, different licence info applies to different subsets, but the high level advice in the DWBP doc still obtains.
Hmm. You are right about the licence issue.
But I think there is an access issue.
That's why I'm looking at BP18.
It says things like "Another way to subset a dataset" - (this actually follows a paragraph that doesn't really tell you how, by the way - it talks about how to access, not how to actually subset).
So the advice on how to subset is rather meagre - basically it says "split it into smaller units".
The example of course helps to flesh it out in one way.
So I go back to what DBpedia does, as a very commonly accessed large dataset.
They provide http://wiki.dbpedia.org/downloads-2016-04
Which is exactly the sort of thing I need.
[Sorry - I should have said that this doesn't just relate to owl:sameAs - there are other predicates I would like to be able to get: owl:differentFrom (to power http://differentfrom.org ), and then all the SKOS and other vocabularies that have things like closeMatch, exactMatch etc..]

I guess my concern is that you wouldn't arrive at the DBpedia solution if you used this document.
All the suggestions in BP18 are about giving dynamic access to a stored dataset.
I can't see where any advice on serving static files is given.
The RDF Data Cube Vocabulary and the Example are both about retrieving subsets from a store.
So, especially for a large dataset, there is little advice of how to subset the data into files, so that it can be served efficiently.
Maybe that is because server efficiency is not in the list of aimed Benefits - although Processability might be thought to include that?

Getting quite specific: :-)
I realise that I have a problem with the paragraph in BP18 that starts
"Consider the expected use cases for your dataset and determine what types of subsets are likely to be most useful. "
I sort of expect it to go on to discuss how to do the considering of the types of substes.
But it doesn't.
It goes on to discuss how to do it.
So, I would split that paragraph after the first sentence.
And I would add, after the first sentence (in the new first para), some words about how it might be by place, by time, by common meaning (my bit!).

Anyway, thanks for engaging with me - very enjoyable.

> 
> *However*
> 
> I think your use case is very pertinent to the Permissions and Obligations Expression WG (hence adding the additional list). Their use case doc includes the kind of thing you're after - I hope. See http://w3c.github.io/poe/ucr/#POE.UC.06 and the following one (which is from the news industry).
Yes, the licence issue is probably a use case for the POE - although it wasn't the primary issue for me.
> 
> As ever, I imagine that this will all come down to identifiers, subsets and, where relevant, named graphs, but the point about provenance and licensing being closely linked is well made.
Thanks - I think it is most important in the Linked Data world, where people are encouraged to add links all the time, and the licence and provenance (and of course I should have mentioned trust!) just get lumped into the same dataset metadata.

Very best
Hugh
> 
> Cheers
> 
> Phil
> 
> [1] https://www.w3.org/TR/dwbp/#DataLicense
> 
> On 29/12/2016 11:32, Hugh Glaser wrote:
>> As suggested by Phil Archer (when I posted this to public-LOD), I am reposting here.
>> (I read Hadley's Facebook post to mean that only W3C members could comment now.)
>> 
>> Repost:
>> https://www.w3.org/TR/dwbp/
>> 
>> Hi.
>> I have just seen a reference to this on Facebook, posted by Hadley - many thanks.
>> 
>> I guess it is all too late (sorry!), but thought I would raise one issue, in case someone here feels they can to take it up.
>> And it is sort of interesting for this list.
>> 
>> As far as I can see (really sorry if I have missed it), there is no suggestion of splitting datasets for licence purposes.
>> There is a bit on it in BP18 for different users and use cases.
>> 
>> The use case I am thinking about is the NYT (New York Times) LD release, all those years ago.
>> There was a bunch of data they had made into LD, and wanted to make it public; they also wanted to make the links that they had established to other datasets public.
>> So they gathered it all together, and put it in one dataset, with the appropriate licence, etc..
>> This would conform (if they did some more), with the Best Practices here.
>> 
>> However, this is probably not the best thing for them.
>> The basic dataset that they wanted to publish came with a bunch of licence restrictions - it is in some sense their treasure map, and they don't want to lose control of it.
>> The linkage, on the other hand, is exactly the stuff they want people to take away and do whatever they like with - after all, it is the very information that people need to find their data in the dataset; in SEO terms, it is driving traffic to their site.
>> 
>> (In my case, in very practical terms, I want to be able to harvest the owl:sameAs triples and put them in sameAs.org, safe in the knowledge that I am not violating any conditions.
>> And, I think, the NYT very much wants me to do that, so that their dataset gets found.)
>> In addition, in a related issue about splitting datasets, the provenance of the linkage is actually usually quite different from the provenance of the dataset. It may be that the linkage is the result of an intern spending the summer doing some work, whereas the rest of the dataset is in fact the result of decades of work (as was the case of the NYT).
>> 
>> DBpedia very helpfully splits out this sort of data - not for licence reasons, I think (at least at the moment, although it might be the case that there should be different licences), but for convenience, with a very large dataset.
>> 
>> An additional use case:
>> Many lhe libraries of the world are making their catalogue subject data available. They have also established links between their catalogue and other catalogues. Using these links, I was able to build http://sameas.org/store/kelle/ , which enables the closures of quite a few of the catalogue equivalences.
>> The libraries were all very happy to give me this linkage information - had this information been bundled up with the catalogue data, the process of allowing me free use would have been much more problematic, and indeed I might not have got any data.
>> 
>> So, is there any scope for comments somewhere about this?
>> I think it would be a great if the idea of providing linkage with a separate licence (even if it is in the same physical distribution of the dataset) could be included.
>> 
>> Best, and season's greetings to you all.
>> 
>> Hugh
>> 
> 
> -- 
> 
> 
> Phil Archer
> Data Strategist, W3C
> http://www.w3.org/
> 
> http://philarcher.org
> +44 (0)7887 767755
> @philarcher1
>
Received on Friday, 6 January 2017 14:05:45 UTC