Re: [dxwg] How to specify the number of records in a dataset (#1571)

@nichtich the two examples are interesting:

### example a)
https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq number of landmarks (very domain-specific unit)

The "size information" is actually part of the description and not an independent number. Also from the description I am not sure if the dataset publisher would like to share a single number:
     
    - 10,433 detected landmarks
    - 62,598 augmented landmarks
    - 73,031 total landmarks.

But I believe the publisher liked to explain the nature of the data. And by accident, the numbers fitted in the textual description.

Observe that this description also ties the description of a dataset to its size. That means that the intend of this dataset is that its evolution is very static. 


### example b)
https://data.nasa.gov/Aerospace/NASA-TechPort/bq5k-hbdz - number of rows and columns (very generic unit)

The portal allows to export it in CSV, RDF, XML. So the size is here not a metadata value but a service offering of the portal in case it can offer the data directly. It is calculated dynamically I assume (or on upload by the publisher).  
That means you get some size indication for CSV but not for RDF.  If I am a dataset publisher and I offer 2 distributions and only for CSV I have to provide a size and not for the RDF offering, what does that mean for my RDF users? Do I as publisher provide a lower quality service or an equal quality service? 

The latter are important questions as in the end publishers should be instructed to perform for all entities they share a common metadata quality. 
If a publisher would add for one distribution a format indication and for another not, then this would be usually considered problematic. (This relates to the challenges I mentioned.)

In general the following statements should be clear what they mean, without additional explanation.
```
<https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq> _:size "10,433".
<https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq> _:size "62,598".

<https://data.nasa.gov/Aerospace/NASA-TechPort/bq5k-hbdz> _:size "15.4K"
``` 

Ps. I randomly clicked in data.europa.eu and I could not find any examples. Maybe bad luck, but that also indicates that the size is not often provided. That is the reason I asked for example portals where size is an important and critical feature for the functioning of that data community. In the NASA data portal the size provisioning is at hoc and probably depending on the dataset owner. I would like to see for instance, data portals that offer based on size different access patterns or payment requirements, etc. At this moment the examples are only those cases where either a) a publisher did some editorial work or b) the data is available in a data warehouse and it calculates some number.  I really would like to discuss more inspiring cases than these. Because those usecases will drive publishers to provide more precise and quality metadata. 


But I see where you are heading, your request is to "officially" adopt `dct:extent` to document the size of a resource.  
As I wrote before adopting such abstract wide property is not the challenge. For 'dct:extent' it is even implicit the case, as I hope the DXWG is first adopting terms from dcterms and only when no fit for purpose is found,  from another namespace. 
I suggest that any profile builder should apply that approach too. 

The challenge is the request for harmonising the value space in some way.  
As the examples illustrate there is no commonality yet. Thus the value space stays open and the decisions are to be made by the implementing profile. 
If adopting this reasoning as a usage note is helping the community, I do not object to add that to the DCAT specification. It will however not resolve the work from any implementer to make its own profile rules. And I have the feeling you aim for that. 























 




-- 
GitHub Notification of comment by bertvannuffelen
Please view or discuss this issue at https://github.com/w3c/dxwg/issues/1571#issuecomment-1631057410 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Tuesday, 11 July 2023 15:39:25 UTC