W3C home > Mailing lists > Public > public-dxwg-wg@w3.org > June 2018

Re: [dxwg] How to express distributions provided as compressed files

From: Jakub Klímek via GitHub <sysbot+gh@w3.org>
Date: Tue, 26 Jun 2018 09:09:04 +0000
To: public-dxwg-wg@w3.org
Message-ID: <issue_comment.created-400237521-1530004143-sysbot+gh@w3.org>
@makxdekkers Let's see on examples of `dcat:Distribution`s for each case.

Note that neither the [File Types](http://publications.europa.eu/mdr/resource/authority/file-type/html/filetypes-eng.html) codelist mandatory in DCAT-AP nor the official [IANA Media Types list](https://www.iana.org/assignments/media-types/media-types.xhtml) are exhaustive, therefore we need to use both.

The simplest case is an uncompressed CSV file (which is actually served with HTTP gzip compression when supported - transparent to DCAT). There is a [CSV on the Web](https://www.w3.org/TR/tabular-metadata/) JSON descriptor of the CSV file in `2007.json`:
```turtle
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> .
```

Now let's add the explicit `.gz` compression of the CSV file and let's assume I use `adms:representationTechnique` for the inner type:

```turtle
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/GZIP> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
    adms:representationTechnique <http://www.iana.org/assignments/media-types/text/csv> .
```
1. There is no way to specify the original `http://publications.europa.eu/resource/authority/file-type/GZIP` file type (and media type), so people searching for CSV files will not find this distribution.
2. The fact that the distribution is CSV is far more interesting than the fact that it is a GZIP file. I wonder if `dct:format` and `dcat:mediaType` should reflect the inner file and rather the compression technique should be specified in `adms:representationTechnique` so that people searching for CSV files would only need to check one property (`dcat:mediaType`), not two. This is also related to the next point.
3. The `dcterms:conformsTo` specifies the JSON descriptor of the inner CSV file, not the gzip file. This supports the point that the whole distribution description should be focused on the inner file, and the compression should be indicated on top of that.

I would therefore suggest (the actual new properties can actually be different, if appropriate ones are found):
```turtle
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;

    dcat:compressionMediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
    dcat:compressionFormat <http://publications.europa.eu/resource/authority/file-type/GZIP> ;
```

Next, the packaging of multiple files. Let's assume that we have a TAR package with a set of homegenous CSV files inside (e.g. for data for individual years). Note that ZIP can be used here as well as packager, not compression:
```turtle
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:format <http://publications.europa.eu/resource/authority/file-type/TAR> ;
    # There is no IANA dcat:mediaType for TAR
    adms:representationTechnique <http://www.iana.org/assignments/media-types/text/csv> .
```
The same points as with the gzip compression above apply here. In addition:

4. There is no indication that there are multiple files in the package. This could be solved by introducing separate properties for packaging technique and for compression technique. The use of the packaging property would indicate there are multiple files inside.

Therefore, I would suggest:
```turtle
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;

# for TAR there is no media type, but e.g. for ZIP there is dcat:packageMediaType <http://www.iana.org/assignments/media-types/application/zip> ;
    dcat:packageFormat <http://publications.europa.eu/resource/authority/file-type/TAR> .
```

Finally, the packaging and compression case. This means multiple CSV files, and for instance TAR packaging and GZIP compression, or ZIP packaging and ZIP compression. Here we need to specify 3 levels - CSV, TAR and GZIP. So I would suggest:
```turtle
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar.gz> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar.gz> ;
    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;

    dcat:packageFormat <http://publications.europa.eu/resource/authority/file-type/TAR> ;
# for TAR there is no media type, but e.g. for ZIP there is dcat:packageMediaType <http://www.iana.org/assignments/media-types/application/zip> ;
    dcat:compressionMediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
    dcat:compressionFormat <http://publications.europa.eu/resource/authority/file-type/GZIP> .
```
This gives the publishers the possibility to describe the distribution properly, and the original DCAT properties are still used for the most important format, which is the innermost one.

Of course the `dcat:compressionMediaType`, `dcat:compressionFormat`, `dcat:packageMediaType` and `dcat:packageFormat` properties actually be some existing ones, if they are found.

-- 
GitHub Notification of comment by jakubklimek
Please view or discuss this issue at https://github.com/w3c/dxwg/issues/259#issuecomment-400237521 using your GitHub account
Received on Tuesday, 26 June 2018 09:09:12 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 30 October 2019 00:15:44 UTC