RE: privacy review notes on EPUB 3.3

Hi Nick,


> These sections are marked as non-normative, but then include lots of normative language. I understand from the conformance sections that you don't intend that text to have the meaning of RFC 2119. However, I am concerned by this as a confusing trend 


Right, W3C has adopted RFC 8174. I can kind of appreciate why, given the contortions mocked by RFC 6919 <> . 😉


More seriously, we’ve tended not to enforce requirements that can’t be tested and/or are going to be implementation specific. Just as an example, an EPUB publication can’t enforce that a publisher obtain consent to collect data prior to a user opening it. We know you should do it, but when we’ve made these sorts of statements normative in the past they’ve been called “aspirational”. Similarly, a reading system should allow users to control any information collected, but there’s no single solution to the problem.


I don’t personally have an issue with these being stated in normative terms, but I don’t have enough experience with W3C standardization process to know if they’d raise flags as being underspecified if we did.


> The rs and a11y specs note cases where certain behavior might violate privacy laws … I don't think that should be the motivation for protecting user privacy


Right, it’s not even necessary in the accessibility spec to cite privacy laws as it’s not adding anything. I’ll have to go back and see what the RS spec says.


> External resources should be loaded securely, for example over HTTPS.


This sounds like a useful restriction.


Apologies for being selective and brief in my comments, as I’m on the road right now. It would probably help to get your concerns into the github tracker rather than discuss over email. I’ll try and do that when I have a minute.




From: Nick Doty <> 
Sent: April 29, 2022 9:39 PM
Subject: Re: privacy review notes on EPUB 3.3


Hi EPUB folks,

First, an apology, as I noted in a github issue thread: I'm sorry that I was slow to follow up with feedback on the changes y'all made to incorporate privacy and security considerations into these drafts. As an explanation (not justification), after a few calls together, I thought a good start was made and didn't realize you were waiting on further feedback from me. Later, I learned that you were, but didn't have time to get back to it! Sorry for the delay.

I think it's great to see in depth privacy considerations text in these documents, and the start of a privacy threat model for epub ebooks. I think those sections could be improved (in particular, see notes below), but I'm happy to see the progress made for a large and rich area like ebooks. We are using web technologies, but in a novel (ha) way that involves a different set of actors, distribution models and privacy and security concepts.

For the most part, it looks like we haven't been able to make normative changes to these specifications to address the noted privacy and security concerns. In some areas, I think that should be addressed, now or during the CR period or, if nothing else, in future versions. In some areas, charter restrictions seem to prohibit any substantive changes: I think it's an industry-level problem that we aren't ready to consider direct, necessary ecosystem improvements.

For EPUB Accessibility 1.1, I didn't previously review this or discuss this with my PING colleagues when we talked through this privacy review. Reviewing it now, I agree with the conclusion that this document doesn't add new features with privacy impact. There are still privacy considerations -- and the document includes a meaningful privacy considerations section now. I can provide some similar comments about that section, but I don't think that's blocking.

I don't think any of this is blocking for the transition to Candidate Recommendation (although I'm not totally clear on the current differences between Wide Review prior to and during the CR period).

Thanks for all your work on this topic and I look forward to more together. Protecting the privacy of readers of ebooks is an important project that includes this standardization effort.

(not a chair speaking on behalf of PING, just a PING reviewer following up)

## priv sections across core, rs, a11y

These sections are marked as non-normative, but then include lots of normative language. I understand from the conformance sections that you don't intend that text to have the meaning of RFC 2119. However, I am concerned by this as a confusing trend (not just in EPUB, but in a growing number of specs). It seems like we mean, implementers really need to do X to prevent this privacy harm, but we know implementers aren't going to do X, so we'll make it non-normative and not hold anyone to it. Non-normative text can be very useful to bring awareness to an implementer of some potential implications, but it's not well suited to including necessary protections but without any intention that they will be implemented. Expansive non-normative considerations mean that implementations can claim that they satisfied every privacy and security requirement of the W3C's published EPUB spec, and that a user cannot rely on that assurance (even to the extent that it would, for example, enable certification or incur liability with consumer protection regulators if misrepresented) to conclude that it's safe to use that software.

The rs and a11y specs note cases where certain behavior might violate privacy laws (and so, implicitly, implementers should take care to protect against that behavior). You might be right about that legal assessment for some jurisdictions, but I don't think that should be the motivation for protecting user privacy or the criterion for which we conclude that user privacy should be protected. We can be concerned for user privacy no matter the jurisdiction and even if violating the user's privacy would be legal in some, many or all jurisdictions.

## security

It seems like there's no functionality to provide authenticity (that the contents of the book come from a known source) or integrity (that the contents have not been altered in the delivery to the user). I know there has been some exploration of different web packaging formats that could be applied; I think this should be pursued for the long-term security of EPUB.

Are these secure contexts, as defined in ? I think we might conclude that these are potentially trustworthy in that the files are available on the local filesystem, and there is typically a move to trust those files largely for the convenience of local development. 

External resources should be loaded securely, for example over HTTPS. Otherwise, threats would also include any network attacker when the book is being read (even separate from whether the book contents itself were securely transferred). (This is currently a non-normative recommendation, but seems like a good candidate for a normative recommendation.)

The lack of authentication and integrity protection should be noted in security considerations and the threat model should include actors who modified the contents of the file between the author and the reader, and actors who distribute inauthentic or malicious files by pretending to be the author.

This academic paper came up during a discussion around the privacy review.
"Reading Between the Lines: An Extensive Evaluation of the Security and Privacy Implications of EPUB Reading Systems"
Have we considered all the threats documented in that paper? My impression is that the authors recommend stronger normative language in the specification to make protections more consistent. (I wasn't trying to conduct a thorough security review and would not be the best qualified to do that, but I know enough that it's worth closely reading an academic paper specifically noting security implications!)

## issues previously raised

1871 browseable web: this is addressed in noting following links as separate browsing contexts, and considerations regarding communicating information across links.

1872 fingerprintability: I don't believe this issue is resolved yet. Providing an additional User-Agent string adds substantially to fingerprintability at a time that we are trying to reduce User-Agent entropy. At a minimum, we should note the fingerprinting risk in the rs spec. Normatively, we should be precise about its severity, recommendations to minimize unnecessary entropy, and clarify whether it's necessary in addition to the existing User-Agent string. (The core spec notes the risk and suggests non-normatively to content authors not to use it for tracking purposes.)

1873 obfuscation: definite improvement here in noting the risks and limitations, making the functionality not required. I think there's still work to be done in providing clarity on migration to better method here, but nothing that needs resolution at the moment.

1874 DRM: the threat is noted, but it still seems a glaring issue to me that DRM is a well-known and  invasive threat to reader privacy and there are no clear protections against it or any likely directions to mitigate those harms in the future. This may not be in the scope of the currently chartered WG.

1875 interactivity: user-generated content is noted in the non-normative privacy considerations, which is an improvement. I think we need to describe the privacy threat of users not having an understanding of when they're interacting with the content (and potentially having that shared with the content author) and when they're interacting with the reading system, so it's understood with whom the reader is communicating when highlighting or making margin notes.

1876 self-contained packages: non-normative text notes the privacy and security advantages, but doesn't expect that they'll be met. Could the spec normatively define the properties necessary for an epub to be a self-contained book, so that users, reading systems, archivists, etc. could know/test that it's self-contained and would have those privacy properties?

## other notes

In rs:
> Reading systems that allow users to store data MUST ensure they do not make that data available to other unrelated documents

I think "unrelated" is undefined; the spec makes clear that there isn't a reliable way to determine that documents are related. I think you just mean other documents.

Received on Wednesday, 4 May 2022 21:52:17 UTC