- From: Nick Doty <ndoty@cdt.org>
- Date: Thu, 14 Jul 2022 13:41:18 -0400
- To: public-privacy@w3.org
Hiya PING, With Lubna Dajani, I've been reviewing DCAT 3, an updated Data Catalog Vocabulary spec. My notes are below; happy to have any PING feedback (and we can discuss on the upcoming call next Thursday, July 21st) and then I'll pass along to the Dataset Exchange Working Group. Cheers, Nick DCAT 3 https://www.w3.org/TR/2022/WD-vocab-dcat-3-20220510/ Data Catalog Vocabulary is a vocabulary for machine-readable metadata about catalogs of datasets and data services. The purposes of this metadata are to ease discovery of datasets and federated search across data catalogs. Vocabularies make for a different sort of challenge in understanding and assessing the privacy and security impacts of a standard. The WG completed the self-review questionnaire, but found much of it not applicable, as many of those questions are very specific to the browsable Web context. The existing privacy and security considerations sections notes the possibility of data about people being included in these metadata files (for example, the creators of a dataset), but addresses all privacy and security protections to the applications that create or make use of the data. Even recognizing that details may be application or dataset specific, can we consider the privacy of the data subject, the people who might be depicted in these datasets, in addition to the privacy of the creator of the records? Are there fields or features that should be defined in a common vocabulary in order to enable applications that use DCAT that are more privacy friendly? For example, ID17 [0] in the list of use cases considers the particular case of communicating restrictions in access to a particular dataset, because it might contain information about people or particularly sensitive information. While license and rights terms are available in the vocabulary [1], there seem to be very detailed terminologies available for different copyright licensing, and very little in terms of access limits (examples seem to be mostly: public, restricted, closed). Should dataset distributors and catalogers have a way to indicate whether a dataset includes information about people, or sensitive information about people, or individualized vs aggregated records about people? Related use cases might be identifying datasets with particular sensitive data so that it can be deleted in situations of danger. (Consider the historical, but still very relevant, example of the destruction of civil registry records in Amsterdam in 1943 [2].) Or highlighting datasets with sensitive data in order to facilitate auditing of their privacy protections. DCAT 3 explicitly considers the case of providing metadata for data services, rather than just full datasets. This seems like a valuable direction, especially for privacy in data analysis, as there may be many data sets that can be made useful while limiting privacy threats by providing interactive services rather than full data dumps. But it's not clear whether the vocabulary is detailed enough to enable the different data services, like searching over an encrypted data set, or differentially private queries with an associated privacy budget, or accessing sample or synthesized data records or summary statistics for a restricted dataset. (The case study on application of differential privacy [3] on the Dataverse project [4] describes some applications of these, but not clear if they're documented in dataset metadata yet.) As a security matter, it's not clear how authenticity or integrity of metadata files or the associated datasets are assured. A checksum property for the dataset file is available (new in DCAT 3), but there seems a risk of a kind of downgrade attack here: someone tampering with the dataset might at the same time be able to tamper with the metadata and its checksum property. Authenticity and integrity might be important security properties to consider; signatures and potentially use of a public key infrastructure might make it possible for a consumer of a dataset to confirm that they know who it came from and that they received it without tampering. Datasets and how they're exchanged and made interoperable are likely to have significant privacy implications, most especially for the people described in those datasets. It may not be immediately obvious on how to apply the same privacy practices that we apply in the interactive browseable Web context, but I hope we can come up with strong privacy and security considerations for standardization work in this area. Thanks to Lubna Dajani for helping with this review; all errors in this write-up are mine. References: [0] https://www.w3.org/TR/dcat-ucr/#ID17 [1] https://www.w3.org/TR/2022/WD-vocab-dcat-3-20220510/#license-rights [2] https://en.wikipedia.org/wiki/1943_bombing_of_the_Amsterdam_civil_registry_office [3] https://admindatahandbook.mit.edu/book/v1.0/diffpriv.html#dataverse-repositories [4] https://dataverse.harvard.edu/ -- Nick Doty | https://npdoty.name Senior Fellow, Internet Architecture Center for Democracy & Technology | https://cdt.org
Received on Thursday, 14 July 2022 17:41:41 UTC