DCAT 3 (draft) privacy review from Nick Doty on 2022-07-14 (public-privacy@w3.org from July to September 2022)

From: Nick Doty <ndoty@cdt.org>
Date: Thu, 14 Jul 2022 13:41:18 -0400
To: public-privacy@w3.org
Message-ID: <CA+tYtvGxwnfvprA+jvw6fmv5uYKWrqa1B2-DPE8hU6TcBTVMTQ@mail.gmail.com>
Hiya PING,

With Lubna Dajani, I've been reviewing DCAT 3, an updated Data Catalog
Vocabulary spec. My notes are below; happy to have any PING feedback
(and we can discuss on the upcoming call next Thursday, July 21st) and
then I'll pass along to the Dataset Exchange Working Group.

Cheers,
Nick

DCAT 3

https://www.w3.org/TR/2022/WD-vocab-dcat-3-20220510/

Data Catalog Vocabulary is a vocabulary for machine-readable metadata
about catalogs of datasets and data services. The purposes of this
metadata are to ease discovery of datasets and federated search across
data catalogs.

Vocabularies make for a different sort of challenge in understanding
and assessing the privacy and security impacts of a standard. The WG
completed the self-review questionnaire, but found much of it not
applicable, as many of those questions are very specific to the
browsable Web context. The existing privacy and security
considerations sections notes the possibility of data about people
being included in these metadata files (for example, the creators of a
dataset), but addresses all privacy and security protections to the
applications that create or make use of the data.

Even recognizing that details may be application or dataset specific,
can we consider the privacy of the data subject, the people who might
be depicted in these datasets, in addition to the privacy of the
creator of the records?

Are there fields or features that should be defined in a common
vocabulary in order to enable applications that use DCAT that are more
privacy friendly? For example, ID17 [0] in the list of use cases
considers the particular case of communicating restrictions in access
to a particular dataset, because it might contain information about
people or particularly sensitive information. While license and rights
terms are available in the vocabulary [1], there seem to be very
detailed terminologies available for different copyright licensing,
and very little in terms of access limits (examples seem to be mostly:
public, restricted, closed). Should dataset distributors and
catalogers have a way to indicate whether a dataset includes
information about people, or sensitive information about people, or
individualized vs aggregated records about people?

Related use cases might be identifying datasets with particular
sensitive data so that it can be deleted in situations of danger.
(Consider the historical, but still very relevant, example of the
destruction of civil registry records in Amsterdam in 1943 [2].) Or
highlighting datasets with sensitive data in order to facilitate
auditing of their privacy protections.

DCAT 3 explicitly considers the case of providing metadata for data
services, rather than just full datasets. This seems like a valuable
direction, especially for privacy in data analysis, as there may be
many data sets that can be made useful while limiting privacy threats
by providing interactive services rather than full data dumps. But
it's not clear whether the vocabulary is detailed enough to enable the
different data services, like searching over an encrypted data set, or
differentially private queries with an associated privacy budget, or
accessing sample or synthesized data records or summary statistics for
a restricted dataset. (The case study on application of differential
privacy [3] on the Dataverse project [4] describes some applications
of these, but not clear if they're documented in dataset metadata
yet.)

As a security matter, it's not clear how authenticity or integrity of
metadata files or the associated datasets are assured. A checksum
property for the dataset file is available (new in DCAT 3), but there
seems a risk of a kind of downgrade attack here: someone tampering
with the dataset might at the same time be able to tamper with the
metadata and its checksum property. Authenticity and integrity might
be important security properties to consider; signatures and
potentially use of a public key infrastructure might make it possible
for a consumer of a dataset to confirm that they know who it came from
and that they received it without tampering.

Datasets and how they're exchanged and made interoperable are likely
to have significant privacy implications, most especially for the
people described in those datasets. It may not be immediately obvious
on how to apply the same privacy practices that we apply in the
interactive browseable Web context, but I hope we can come up with
strong privacy and security considerations for standardization work in
this area.

Thanks to Lubna Dajani for helping with this review; all errors in
this write-up are mine.

References:

[0] https://www.w3.org/TR/dcat-ucr/#ID17
[1] https://www.w3.org/TR/2022/WD-vocab-dcat-3-20220510/#license-rights
[2] https://en.wikipedia.org/wiki/1943_bombing_of_the_Amsterdam_civil_registry_office
[3] https://admindatahandbook.mit.edu/book/v1.0/diffpriv.html#dataverse-repositories
[4] https://dataverse.harvard.edu/

-- 
Nick Doty | https://npdoty.name
Senior Fellow, Internet Architecture
Center for Democracy & Technology | https://cdt.org
Received on Thursday, 14 July 2022 17:41:41 UTC