Re: [dxwg] Distributions, services and implementation-resources

Content copied over from #52:

agreiner commented 6 hours ago
I don't disagree with the text here, but I think it worth pointing out that it is a bit paradoxical with respect to what some of us have been asserting with regard to profile negotiation.
"The definition text of dcat:Distribution has been revised to clarify that distributions are primarily representations of datasets. As such, all distributions of a given dataset should be informationally equivalent. " Here, it is assumed that representations of a dataset are informationally equivalent, but profile negotiation would return datasets that are not informationally equivalent, because different profiles may include different subsets of the dataset. My preference is to keep distributions informationally equivalent and ask ourselves if there is a way to make it clear that profile negotiation does not deliver informationally equivalent responses.

rob-metalinkage commented 6 hours ago • 
Narrowing the scope, as proposed, breaks backwards compatibility with existing DCAT implementations.

Services that support queries against a dataset are never "informationally equivalent

If this restricted view is held, then any distrubution that supports accessing a file remotely, is by definiton not a dcat:Distribution but a dcat:SistributonService. Just basic web architecture of allowing a HTTP HEAD request is sufficient to break information equivalence, and content negotiation over langauge or mime type also does. Different formats are not informationally equivalent - for example a CSV file loses relationships between attributes compared to complex properties:

id,value1, units1, value2,units2
1, 2.3, "m/s",6.7,"kg"

{ id: 1 ;
value1: { value: 2.3 ; units "m/s" ; }
value2: { value: 6.7 ; units "kg" ;

CSV holds less information because value1 and units1 need further out-of-band information to be related to each other.

So - unless you can come with a robust statement about testability of information equivalence, it strikes me as a slippery slope with no huge value.

OTOH Making an explicit statement that Distributions may not be informationally equivalent seems quite valuable, and makes services equivalent with distributions logically consistent

rob-metalinkage commented 5 hours ago
further to that - if we know what profiles each distribution and/or services support, perhaps its up to the profiles to be described in a way that makes informational equivalence visible - for example maybe whats really required is a implementation resource to transform a profile into another profile.

Use Cases for reliance on information equivalence would seem to be missing - i think really you would need to find evidence for such a need.

agreiner commented 5 hours ago
You are right that CSV can offer less information than JSON, and is particularly likely to do so if there is relational information to be shared, though I would argue that your CSV example shows the relationship between the two values by including them on the same line. Clearly, one can publish informationally equivalent data in both formats, and one can also make the mistake of dropping information when translating from one to the other. One might caution publishers to avoid selecting CSV that drops relationships in any guidance document. One might also caution them against dropping out entire rows from a CSV, but one would not then assume that CSV needs to be treated as a form that is inconsistent in informational content. A little googling shows me two definitions of informational equivalence: (1) that information is equivalent if all the information in one representation can be inferred from the other, and (2) that information is equivalent if the same tasks can be performed with both. I don't claim to be expert in information theory (an MIMS degree notwithstanding), but this doesn't seem an intractable problem. (ref:'informationally%20equivalent'&f=false).

agreiner commented 5 hours ago
I can think of several use cases for equivalence of informational content. If a two different users wish to avail themselves of data provided from an API, they may each have ingest tools already existing to handle data in different serializations. Neither would want to spend time reworking their tool to handle the other serialization. Another is the issue of reproducibility, comparing data from different analyses to determine whether one should expect them to find similar conclusions.

rob-metalinkage commented 4 hours ago
Do we have some conflicting perspectives @makxdekkers - i think somethwhere you argued that using DCAT 1.0 to catalogue the DCAT-AP and its distribution resources should be validly backwards compatible, but these resources are not informationally equivalent (if we agree either of the defs found by @agreiner are reasonable).

I think we would need to formalise the Use Case and agree on its requirements, and would need the existing approaches to be populated to show that there are cases we need to handle where we need to assert information equivalence. I think the general concern raised by @agreiner could be handled better by profile descriptions, particularly given the nuances of transformation that might exist in different contexts it would be hard to define a specific model and enforce it for all past DCAT usage.

agreiner commented 4 hours ago
Uh Oh, thinking this through a bit more, I'm starting to wonder what the difference between a Distribution and a DistributionService would really be. Both deliver a series of TCP packets that become a file when assembled back on the client's system. Both involve downloading something from a URI. One can build a simple REST API by simply posting json files under URIs that show the relationships between them. A REST API does in fact deliver representations of datasets that are transported as files. Hm.

dr-shorthair commented 3 hours ago
DataDistributionServices like instances of OGC's Web Feature Service, do not appear to be the same as a Distribution, at least not to users. WFS accepts a query and responds with a file. These kinds of service have a long history, predating the REST theory.

I see that fully RESTful 'services' resemble distributions, because of the resource-oriented way that you address them. However, there is still a challenge in that the set of resources (distributions) available from many services is combinatorially large. Describing this set as a 'service' is at the very least a pragmatic solution to this - otherwise a catalog would be overwhelmed by the enumerated listing of the resources available from it. The 'service' in this case is the set of potential datasets/distributions that might be constructed through selection of the various query parameters.

GitHub Notification of comment by dr-shorthair
Please view or discuss this issue at using your GitHub account

Received on Tuesday, 25 September 2018 03:31:49 UTC