Re: privacy review notes on EPUB 3.3

Hi EPUB folks,

First, an apology, as I noted in a github issue thread: I'm sorry that I
was slow to follow up with feedback on the changes y'all made to
incorporate privacy and security considerations into these drafts. As an
explanation (not justification), after a few calls together, I thought a
good start was made and didn't realize you were waiting on further feedback
from me. Later, I learned that you were, but didn't have time to get back
to it! Sorry for the delay.

I think it's great to see in depth privacy considerations text in these
documents, and the start of a privacy threat model for epub ebooks. I think
those sections could be improved (in particular, see notes below), but I'm
happy to see the progress made for a large and rich area like ebooks. We
are using web technologies, but in a novel (ha) way that involves a
different set of actors, distribution models and privacy and security
concepts.

For the most part, it looks like we haven't been able to make normative
changes to these specifications to address the noted privacy and security
concerns. In some areas, I think that should be addressed, now or during
the CR period or, if nothing else, in future versions. In some areas,
charter restrictions seem to prohibit any substantive changes: I think it's
an industry-level problem that we aren't ready to consider direct,
necessary ecosystem improvements.

For EPUB Accessibility 1.1, I didn't previously review this or discuss this
with my PING colleagues when we talked through this privacy review.
Reviewing it now, I agree with the conclusion that this document doesn't
add new features with privacy impact. There are still privacy
considerations -- and the document includes a meaningful privacy
considerations section now. I can provide some similar comments about that
section, but I don't think that's blocking.

I don't think any of this is blocking for the transition to Candidate
Recommendation (although I'm not totally clear on the current differences
between Wide Review prior to and during the CR period).

Thanks for all your work on this topic and I look forward to more together.
Protecting the privacy of readers of ebooks is an important project that
includes this standardization effort.

Cheers,
Nick
(not a chair speaking on behalf of PING, just a PING reviewer following up)

## priv sections across core, rs, a11y

These sections are marked as non-normative, but then include lots of
normative language. I understand from the conformance sections that you
don't intend that text to have the meaning of RFC 2119. However, I am
concerned by this as a confusing trend (not just in EPUB, but in a growing
number of specs). It seems like we mean, implementers really need to do X
to prevent this privacy harm, but we know implementers aren't going to do
X, so we'll make it non-normative and not hold anyone to it. Non-normative
text can be very useful to bring awareness to an implementer of some
potential implications, but it's not well suited to including necessary
protections but without any intention that they will be implemented.
Expansive non-normative considerations mean that implementations can claim
that they satisfied every privacy and security requirement of the W3C's
published EPUB spec, and that a user cannot rely on that assurance (even to
the extent that it would, for example, enable certification or incur
liability with consumer protection regulators if misrepresented) to
conclude that it's safe to use that software.

The rs and a11y specs note cases where certain behavior might violate
privacy laws (and so, implicitly, implementers should take care to protect
against that behavior). You might be right about that legal assessment for
some jurisdictions, but I don't think that should be the motivation for
protecting user privacy or the criterion for which we conclude that user
privacy should be protected. We can be concerned for user privacy no matter
the jurisdiction and even if violating the user's privacy would be legal in
some, many or all jurisdictions.

## security

It seems like there's no functionality to provide authenticity (that the
contents of the book come from a known source) or integrity (that the
contents have not been altered in the delivery to the user). I know there
has been some exploration of different web packaging formats that could be
applied; I think this should be pursued for the long-term security of EPUB.

Are these secure contexts, as defined in
https://www.w3.org/TR/secure-contexts/ ? I think we might conclude that
these are potentially trustworthy in that the files are available on the
local filesystem, and there is typically a move to trust those files
largely for the convenience of local development.

External resources should be loaded securely, for example over HTTPS.
Otherwise, threats would also include any network attacker when the book is
being read (even separate from whether the book contents itself were
securely transferred). (This is currently a non-normative recommendation,
but seems like a good candidate for a normative recommendation.)

The lack of authentication and integrity protection should be noted in
security considerations and the threat model should include actors who
modified the contents of the file between the author and the reader, and
actors who distribute inauthentic or malicious files by pretending to be
the author.

This academic paper came up during a discussion around the privacy review.
"Reading Between the Lines: An Extensive Evaluation of the Security and
Privacy Implications of EPUB Reading Systems"
https://lirias.kuleuven.be/retrieve/616428
Have we considered all the threats documented in that paper? My impression
is that the authors recommend stronger normative language in the
specification to make protections more consistent. (I wasn't trying to
conduct a thorough security review and would not be the best qualified to
do that, but I know enough that it's worth closely reading an academic
paper specifically noting security implications!)

## issues previously raised

1871 browseable web: this is addressed in noting following links as
separate browsing contexts, and considerations regarding communicating
information across links.

1872 fingerprintability: I don't believe this issue is resolved yet.
Providing an additional User-Agent string adds substantially to
fingerprintability at a time that we are trying to reduce User-Agent
entropy. At a minimum, we should note the fingerprinting risk in the rs
spec. Normatively, we should be precise about its severity, recommendations
to minimize unnecessary entropy, and clarify whether it's necessary in
addition to the existing User-Agent string. (The core spec notes the risk
and suggests non-normatively to content authors not to use it for tracking
purposes.)

1873 obfuscation: definite improvement here in noting the risks and
limitations, making the functionality not required. I think there's still
work to be done in providing clarity on migration to better method here,
but nothing that needs resolution at the moment.

1874 DRM: the threat is noted, but it still seems a glaring issue to me
that DRM is a well-known and  invasive threat to reader privacy and there
are no clear protections against it or any likely directions to mitigate
those harms in the future. This may not be in the scope of the currently
chartered WG.

1875 interactivity: user-generated content is noted in the non-normative
privacy considerations, which is an improvement. I think we need to
describe the privacy threat of users not having an understanding of when
they're interacting with the content (and potentially having that shared
with the content author) and when they're interacting with the reading
system, so it's understood with whom the reader is communicating when
highlighting or making margin notes.

1876 self-contained packages: non-normative text notes the privacy and
security advantages, but doesn't expect that they'll be met. Could the spec
normatively define the properties necessary for an epub to be a
self-contained book, so that users, reading systems, archivists, etc. could
know/test that it's self-contained and would have those privacy properties?

## other notes

In rs:
> Reading systems that allow users to store data MUST ensure they do not
make that data available to other unrelated documents

I think "unrelated" is undefined; the spec makes clear that there isn't a
reliable way to determine that documents are related. I think you just mean
other documents.

Received on Saturday, 30 April 2022 01:39:52 UTC