Re: [dxwg] How to specify the number of records in a dataset (#1571) from Bert Van Nuffelen via GitHub on 2023-07-11 (public-dxwg-wg@w3.org from July 2023)

From: Bert Van Nuffelen via GitHub <sysbot+gh@w3.org>
Date: Tue, 11 Jul 2023 08:59:46 +0000
To: public-dxwg-wg@w3.org
Message-ID: <issue_comment.created-1630435531-1689065984-sysbot+gh@w3.org>

@all, a little as expected, there are very different, yet specific, expectations of size.

I observe the following:

1. the number of entities in a distribution (e.g. coins)
2. the number of data structure elements in a distributions (e.g. rows)
3. a qualification of the number (e.g. small, medium, ...)
4. the effect on the storage infrastructure (e.g. inodes)

To get a harmonised view the size will be a complex datatype, having properties:
- value: the number
- unit: what is counted
- method: the method of counting

I see the following challenges:
- From my experience sizes are complex to maintain. They typically require coding effort and thus the users should provide the motivation to the publishers to spend this effort.
- Sizes to compare different distributions are successful if all publishers participate. When a substantial amount of the publishers do not participate, the value reduces and thus the publishers that are providing the value loose motivation.
- data portals harvest from different ecosystems (e.g. consider data.europa.eu): so you get Climate datasets with footprints of TBs, next to a public statistic in an excel and next to road network APIs. So this provides a very heterogeneous size experience, with a very divers user base. So the semantics of the property size should allow for this case.

From this I see _sizes_ feature more in a specific profile of DCAT for a specific usage case.
I think the diversity makes it hard to come of a consolidated approach. I also believe that my suggestion of an extended datatype will not be adopted because it will be perceived a too complicated. But introducing a property with a value space that is ambiguous to interpret is (e.g. is "100" =? 100 coins or 100 records or 100 TB) is also not a good idea.
Therefore it is better that each ecosystem defines in its own namespace the size of its needs. It gets the best of both worlds: the ecosystem can express it, and the semantics are clear. And if the profile is well published than anyone can interpret it.

If the semantics of the semantics of size is left to the ecosystem to define in a profile, then my opinion is to not include the property in DCAT, but immediately push it to the profile. Introducing "abstract properties" that should not be used is not very helpful.

@agreiner introduces an interesting notion "usefulness of a dataset". That is I think the key of the story here. Size could play a role in such an assessment, but that is very user and use case specific. I might be biased, but I think size is overrated in this assessment. Other properties will play probably a more important role (as size is not provided often cfr the challenges I listed).

I think it would be good to provide evidence from existing data portals and communities where size is a critical and well maintained properties, before introducing a property.

--
GitHub Notification of comment by bertvannuffelen
Please view or discuss this issue at https://github.com/w3c/dxwg/issues/1571#issuecomment-1630435531 using your GitHub account

--
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Tuesday, 11 July 2023 08:59:48 UTC