[dxwg] Dataset subsets and size characteristics from Vladimir Alexiev via GitHub on 2018-03-08 (public-dxwg-wg@w3.org from March 2018)

From: Vladimir Alexiev via GitHub <sysbot+gh@w3.org>
Date: Thu, 08 Mar 2018 09:21:44 +0000
To: public-dxwg-wg@w3.org
Message-ID: <issues.opened-303409236-1520500903-sysbot+gh@w3.org>

VladimirAlexiev has just created a new issue for https://github.com/w3c/dxwg:

== Dataset subsets and size characteristics ==
Submitting a new USE CASE:

---
### Dataset subsets and size characteristics
Status:

Identifier: ID51 (proposed)

Creator: Vladimir Alexiev, Ontotext

Deliverable(s): DCAT1.1

## Tags
semantics statistics size

## Stakeholders
Data consumers often need to know how many of what sort of entities are included in a dataset.
In an aggregation scenario, different subsets (parts of a dataset) need to be expressed, eg because they come from different data providers.

Eg in the euBusinessGraph project we have a need to describe **Company** datasets by different providers,
what properties are included in each (eg `ebg:isStartup, org:orgActivity`),
and some partition info eg "the dataset covers jurisdiction Italy" or "the dataset has 1000 Italian startups"
(i.e. `rov:RegisteredOrganization` with `ebg:isStartup=true` and jurisdiction Italy)

## Problem statement
[DCAT 1.0](https://www.w3.org/TR/vocab-dcat/#Property:distribution_size) has only a property `dcat:byteSize`, which is pretty useless to describe any aspect of dataset content or value.

And it has no means of expressing subsets.

## Existing approaches
[VOID statistics](https://www.w3.org/TR/void/#statistics) includes these `void:` counting props: `triples, entities, classes, properties, distinctSubjects, distinctObjects, documents`.

Very importantly, these can be used on [subsets](https://www.w3.org/TR/void/#subset) such as [classPartition and propertyPartition](https://www.w3.org/TR/void/#class-property-partitions), which provides very powerful means to describe exactly what kinds of entities are present, and how many are in the dataset.
Thus I believe that subsets are instrumental in expressing the fine-grained content of a dataset.

## Links
Schema issue https://github.com/schemaorg/schemaorg/issues/1855

## Requirements
Ability to express the fine-grained content of a dataset:
* Ability to express subsets of a dataset.
* Describe subsets by kind of entity (e.g. Companies vs Events) and/or entity characteristics (e.g. Italian companies, Startups, Startups in Italy)
* The kinds and characteristics should be expressed by URLs
* Express the count of entities in a dataset or subset
* Optionally, express other dataset size characteristics. E.g. in RDF context, that's number of triples and nodes

Notes:
* It's pretty clear how to do this for RDF datasets (see VOID). The real challenge is how to do it for other datasets.
* I think it's mandatory to express subset characterization with URLs and not text.
* I think there's also need to express assertions to be used for characterization, eg `ebg:isStartup=true`

## Related use cases
ID33, ID7, RDSAT, RSS.

This one could be merged into ID33 to provide further details.

Please view or discuss this issue at https://github.com/w3c/dxwg/issues/161 using your GitHub account

Received on Thursday, 8 March 2018 09:21:46 UTC