- From: Nick Doty <ndoty@cdt.org>
- Date: Mon, 25 Oct 2021 19:50:40 -0400
- To: public-epub-wg@w3.org, public-privacy@w3.org
We had a useful, interesting discussion at a recent PING call [0] where I tried to walk through some questions and some initial privacy concerns of EPUB 3.3 [1]. I've written those up more directly here to facilitate further discussion. This write-up is belated, but hoped to be useful for the joint meeting tomorrow: 26 October 2021, 15:00 - 16:00 UTC https://www.w3.org/events/meetings/2e397c32-b5e1-4f5b-8789-741d09c285ed Those comments fall into roughly 5 categories: self-contained packages; interactivity; DRM; information exposure and fingerprintability; ebooks and the browseable web. As is my style, I've written these primarily as questions. Apologies that this is long and that some of this is repeating questions from our earlier conversation. Cheers, Nick Doty CDT [0] https://www.w3.org/Privacy/IG/summaries/PING-minutes-20210916 [1] https://w3c.github.io/epub-specs/epub33/core/ I looked primarily at this draft of the core spec, but maybe the reading systems or other associated specs would also be relevant. ## self-contained packages Self-contained packages have potential huge privacy advantages, but it's not clear that the EPUB spec or current implementations fulfill these opportunities. Is that a goal that the community could work towards? The privacy advantages for readers are that self-contained packages could have the same contents for all readers, be consumed and shared with others offline and not reveal data about reading habits to authors, to publishers, to retailers, to reading system vendors or to other parties. That would have many analogs to the privacy of reading a physical book, like one checked out from the library, purchased at a book store or borrowed from a friend. The current spec anticipates and requires (at least a SHOULD) loading of remote resources from arbitrary origins. This introduces risks of additional data collection about who is reading what book (and from where), and what part of the book is being read at a particular moment (depending on the implementation or requirements on how remote resources are loaded). And remote resource loading should also make explicit that the author/publisher of the book may effectively be collecting data on the reader's habits, in addition to the reading system. Different levels of scripting access are defined, but it's not clear whether any such level would indicate that user reading behavior would not be disclosed. It would be useful to specify a privacy threat model specific to EPUB, to the extent that it varies from the Web. Can we guarantee that reading habits will not be surveilled, by the publisher, the retailer, the reading system, or other parties? Or if that data is revealed, then we should clarify to whom or under what conditions. Book-like privacy could be achieved, but would require significant changes from the current spec and current popular implementations. The spec suggests that the manifest is an exhaustive pre-stated list of resources, including remote resources, but it's not clear how that's intended to be handled by reading systems. Should a reading system refuse to fetch any remote resource not included in the manifest? Are remote resources intended to be the same for all readers of a book, or might they be personalized? Remote resources loaded over insecure transports would risk exposing more detailed reading behavior to parties on the network. Loading resources over insecure transports also risks the integrity of the EPUB's contents: books could be altered in ways that are harmful to the reader, or (in the case of scripts) personal data about the reader could be further disclosed. ## interactivity What is the model of interactivity? How should users know or control with whom they are interacting? Is integrity or authenticity provided? How does the reader know who authored an epub, and confirm that it wasn't altered? (The Web makes use of HTTPS, the origin concept and Certificate Authorities, among other technology, to provide users with confidence about integrity and authenticity.) Do digital signatures as defined in the spec provide integrity or authenticity of a book? To what extent does that match guarantees of the web model (a known origin, no mixed content, confidentiality of communication contents). Would ongoing work on signed exchanges be helpful? Do epubs allow entry of user-generated text? Does that text remain local? How does a user distinguish between interactivity that is provided by the reading system and interactivity that is provided by the book itself? When are they communicating with which piece of software? Annotations -- including highlights, margin notes, answers to in-book surveys, etc. -- can reveal very sensitive information that a reader might not wish to disclose to anyone else. Do reading systems distinguish chrome in a way that provides security to the end user? Do ebooks typically display at full screen? Can they mimic websites and phish users? Our Web privacy guidance typically includes questions about "native UI" to cover cases like these: if there is no distinction between the UI provided by the user agent and the UI of the browseable content itself, then an interactive web site (or ebook) can effectively pretend to be a different site, and phish user credentials, for example. This would be a particular concern if epub reading functionality was provided by a web browser and users became accustomed to clicking links in ebooks to continue browsing elsewhere. ## privacy and drm What are the privacy impacts of DRM and how can they be mitigated? EPUBs are regularly distributed (after purchase, or perhaps more accurately, paying for limited license to read) with the contents fully encrypted and mediated by a DRM system (in particular, Adobe Digital Editions). Public reporting indicates that EPUB DRM includes collection of unique identifiers tied to the user's hardware, as well as personally identifying information about the user (like an email address), and then makes regular updates on where the user reads the book, how much of it they've read, how much time spent reading, how much of the text they might have copied or printed, and that this data is regularly sent home without the user's understanding or control, perhaps over insecure transports. Maybe some of these issues have been subsequently addressed or mitigated, but it's not clear how users could have confidence in that. DRM and DRM licensing also effectively prevent user transparency into the privacy properties of the EPUB file itself: users cannot inspect the EPUB files that they read, cannot use open source software to analyze them, cannot customize software to limit how resources or loaded or information about them is shared. Direct, opaque surveillance of user reading behavior tied to unique identifiers and personal information, combined with a lack of capabilities for user transparency or control -- this would be a dire precedent for Web privacy and is inconsistent with existing Web privacy model. CDT raised questions about the privacy impacts of DRM starting 15 years ago [2], and the current state does not seem to support user privacy. [2] https://cdt.org/wp-content/uploads/copyright/20060907drm.pdf See in particular: "transparency" and "collateral impact" While this spec does not directly standardize DRM of EPUB files, it does design the file and package format in order to facilitate the DRM encryption of every file within the EPUB package. From the design of this specification, users cannot typically look at the table of contents or the stylesheet of a purchased ebook. When W3C has considered DRM in extensions to HTML, there have specific measures taken to reduce the harm [3], including trying to limit the scope and mitigate the privacy and security problems. Those mitigations have included limits on scope of identifiers, making data clearable, prohibiting network access, limiting access to identifiers from applications, and user transparency and consent. [3] https://www.w3.org/TR/encrypted-media/#privacy For what it's worth -- this isn't specifically privacy-related, but can have downstream effects -- DRM as applied to EPUB is currently dramatically limiting interoperability for end users. I have been unable to buy a book and then move it to another device and another ebook reading system, and I'm not aware of anyone else who has done this successfully; users are locked in to the app or device of their bookstore. This finding from the TAG seems relevant [4]: "The constant competition and variety of choices that come from having multiple interoperable implementations means the web ecosystem is constantly improving." [4] https://www.w3.org/2001/tag/doc/ethical-web-principles/#multi ### obfuscation What value is provided by this obfuscation algorithm? Can this feature be marked at risk, given the uncertainty about whether it satisfies any vendor goals while increasing the complexity of the spec and making the source less easily inspectable by the reader? Why does the creation of the obfuscation key based on the SHA-1 hash function include a SHOULD requirement rather than a MUST? This relaxation seems primarily to decrease interoperability. The obfuscation section contains no requirements on reading systems. Maybe they are just implicitly supposed to de-obfuscate these resources in order to render the book as intended. Are reading systems expected not to provide unobfuscated access to these obfuscated files to users? Is obfuscation limited to font files? Is there a reason that other Web technologies for fonts cannot be used? User-agent-provided obfuscation of arbitrary Web resources would be a step welcomed by some, but would be considered a new and user-hostile part of the Web as a platform. Is there a reason obfuscation is particularly needed for EPUB, or would the same reasoning be expected to apply to the Web generally? Is obfuscation consistent with ethical web principles, in particular the "view source" principle [5]? [5] https://www.w3.org/2001/tag/doc/ethical-web-principles/#transparent ## information exposure and fingerprintability What data on user device is revealed and what is the risk of fingerprintability? This spec appears to define a reading system user agent string `epubReadingSystem`. Is there still a `navigator.userAgent` or similar? Is this a replacement or a duplication of that feature? There is also ongoing work to limit or deprecate user agent strings on the Web platform -- to make it an explicit opt-in rather than always disclosed in great detail. At the very least, we need to recommend that user agent strings have entropy that is strictly limited as necessary for debugging and compatibility. And it should be noted that `epubReadingSystem` reveals information about how the reader is reading the book, potentially back to the author/publisher of the book, unless scripting is more strictly limited. Feature detection is generally preferable, although that could also reveal information about the user's configuration, especially if this list of features is likely to vary based on user configuration/capability as opposed to just which ebook reader is in use. Our Note on mitigating fingerprinting has some relevant advice on how to evaluate severity and apply mitigations [6]. [6] https://www.w3.org/TR/fingerprinting-guidance/ As every EPUB is considered a separate origin, the threat model here is: can an author/publisher of multiple ebooks learn from these configuration characteristics that the same user is reading both of them? And on information disclosure (perhaps because the ebook publisher already knows the exact customer who purchased that particular copy of that book), does the publisher learn something about the customer's devices or software choices? ## EPUBs and browseable web A lingering question and concern for me is how interaction will take place between reading an EPUB ebook and browsing the Web more generally. If a reader taps on a link in an ebook, will any information about the ebook (or personal information about the reader who purchased the ebook) be passed on to the target website of that link? Or will there be links between ebooks or from a website into a particular page on an ebook? Documenting each EPUB as a separate origin for security and privacy purposes is useful. A fear of having opacity about DRM and what personal information about a reader is embedded in an ebook is that those identifiers could also leak outside of the context of reading that book, through annotated links, for example, or to embedded iframes/third party resources. I believe the security and privacy considerations section of the EPUB spec should be very substantially expanded. In particular, currently it considers the privacy of ebook creators who may embed personal information (and I do think that's relevant to think about, and I haven't really covered it in my reviewing), but considers less threats to the privacy of readers of EPUB books. Applying the self-review questionnaire on security and privacy [7] to EPUB may not be trivial as some of the characteristics of ebooks are different, but hopefully there can also be useful analogs that we've raised in this discussion. And there may be direct connections especially if we want ebooks and web sites to be more fluidly accessed in the future. And as always feedback would be welcome on how to improve that questionnaire or our process of privacy reviews in making it more useful. [7] https://www.w3.org/TR/security-privacy-questionnaire/
Received on Monday, 25 October 2021 23:52:19 UTC