- From: Vladimir Alexiev via GitHub <sysbot+gh@w3.org>
- Date: Thu, 08 Mar 2018 09:21:44 +0000
- To: public-dxwg-wg@w3.org
VladimirAlexiev has just created a new issue for https://github.com/w3c/dxwg: == Dataset subsets and size characteristics == Submitting a new USE CASE: --- ### Dataset subsets and size characteristics Status: Identifier: ID51 (proposed) Creator: Vladimir Alexiev, Ontotext Deliverable(s): DCAT1.1 ## Tags semantics statistics size ## Stakeholders Data consumers often need to know how many of what sort of entities are included in a dataset. In an aggregation scenario, different subsets (parts of a dataset) need to be expressed, eg because they come from different data providers. Eg in the euBusinessGraph project we have a need to describe **Company** datasets by different providers, what properties are included in each (eg `ebg:isStartup, org:orgActivity`), and some partition info eg "the dataset covers jurisdiction Italy" or "the dataset has 1000 Italian startups" (i.e. `rov:RegisteredOrganization` with `ebg:isStartup=true` and jurisdiction Italy) ## Problem statement [DCAT 1.0](https://www.w3.org/TR/vocab-dcat/#Property:distribution_size) has only a property `dcat:byteSize`, which is pretty useless to describe any aspect of dataset content or value. And it has no means of expressing subsets. ## Existing approaches [VOID statistics](https://www.w3.org/TR/void/#statistics) includes these `void:` counting props: `triples, entities, classes, properties, distinctSubjects, distinctObjects, documents`. Very importantly, these can be used on [subsets](https://www.w3.org/TR/void/#subset) such as [classPartition and propertyPartition](https://www.w3.org/TR/void/#class-property-partitions), which provides very powerful means to describe exactly what kinds of entities are present, and how many are in the dataset. Thus I believe that subsets are instrumental in expressing the fine-grained content of a dataset. ## Links Schema issue https://github.com/schemaorg/schemaorg/issues/1855 ## Requirements Ability to express the fine-grained content of a dataset: * Ability to express subsets of a dataset. * Describe subsets by kind of entity (e.g. Companies vs Events) and/or entity characteristics (e.g. Italian companies, Startups, Startups in Italy) * The kinds and characteristics should be expressed by URLs * Express the count of entities in a dataset or subset * Optionally, express other dataset size characteristics. E.g. in RDF context, that's number of triples and nodes Notes: * It's pretty clear how to do this for RDF datasets (see VOID). The real challenge is how to do it for other datasets. * I think it's mandatory to express subset characterization with URLs and not text. * I think there's also need to express assertions to be used for characterization, eg `ebg:isStartup=true` ## Related use cases ID33, ID7, RDSAT, RSS. This one could be merged into ID33 to provide further details. Please view or discuss this issue at https://github.com/w3c/dxwg/issues/161 using your GitHub account
Received on Thursday, 8 March 2018 09:21:46 UTC