privacy review notes on EPUB 3.3 from Nick Doty on 2021-10-25 (public-privacy@w3.org from October to December 2021)

From: Nick Doty <ndoty@cdt.org>
Date: Mon, 25 Oct 2021 19:50:40 -0400
To: public-epub-wg@w3.org, public-privacy@w3.org
Message-ID: <CA+tYtvHSYnRbWY15bqdxS2DCvfPtRm6x5_XdGa9_fdbwDoXhcA@mail.gmail.com>
We had a useful, interesting discussion at a recent PING call [0]
where I tried to walk through some questions and some initial privacy
concerns of EPUB 3.3 [1]. I've written those up more directly here to
facilitate further discussion.

This write-up is belated, but hoped to be useful for the joint meeting tomorrow:
26 October 2021, 15:00 - 16:00 UTC
https://www.w3.org/events/meetings/2e397c32-b5e1-4f5b-8789-741d09c285ed

Those comments fall into roughly 5 categories: self-contained
packages; interactivity; DRM; information exposure and
fingerprintability; ebooks and the browseable web. As is my style,
I've written these primarily as questions. Apologies that this is long
and that some of this is repeating questions from our earlier
conversation.

Cheers,
Nick Doty
CDT

[0] https://www.w3.org/Privacy/IG/summaries/PING-minutes-20210916

[1] https://w3c.github.io/epub-specs/epub33/core/
I looked primarily at this draft of the core spec, but maybe the
reading systems or other associated specs would also be relevant.


## self-contained packages

Self-contained packages have potential huge privacy advantages, but
it's not clear that the EPUB spec or current implementations fulfill
these opportunities. Is that a goal that the community could work
towards?

The privacy advantages for readers are that self-contained packages
could have the same contents for all readers, be consumed and shared
with others offline and not reveal data about reading habits to
authors, to publishers, to retailers, to reading system vendors or to
other parties. That would have many analogs to the privacy of reading
a physical book, like one checked out from the library, purchased at a
book store or borrowed from a friend.

The current spec anticipates and requires (at least a SHOULD) loading
of remote resources from arbitrary origins. This introduces risks of
additional data collection about who is reading what book (and from
where), and what part of the book is being read at a particular moment
(depending on the implementation or requirements on how remote
resources are loaded). And remote resource loading should also make
explicit that the author/publisher of the book may effectively be
collecting data on the reader's habits, in addition to the reading
system. Different levels of scripting access are defined, but it's not
clear whether any such level would indicate that user reading behavior
would not be disclosed.

It would be useful to specify a privacy threat model specific to EPUB,
to the extent that it varies from the Web. Can we guarantee that
reading habits will not be surveilled, by the publisher, the retailer,
the reading system, or other parties? Or if that data is revealed,
then we should clarify to whom or under what conditions. Book-like
privacy could be achieved, but would require significant changes from
the current spec and current popular implementations.

The spec suggests that the manifest is an exhaustive pre-stated list
of resources, including remote resources, but it's not clear how
that's intended to be handled by reading systems. Should a reading
system refuse to fetch any remote resource not included in the
manifest? Are remote resources intended to be the same for all readers
of a book, or might they be personalized?

Remote resources loaded over insecure transports would risk exposing
more detailed reading behavior to parties on the network. Loading
resources over insecure transports also risks the integrity of the
EPUB's contents: books could be altered in ways that are harmful to
the reader, or (in the case of scripts) personal data about the reader
could be further disclosed.

## interactivity

What is the model of interactivity? How should users know or control
with whom they are interacting?

Is integrity or authenticity provided? How does the reader know who
authored an epub, and confirm that it wasn't altered? (The Web makes
use of HTTPS, the origin concept and Certificate Authorities, among
other technology, to provide users with confidence about integrity and
authenticity.)

Do digital signatures as defined in the spec provide integrity or
authenticity of a book? To what extent does that match guarantees of
the web model (a known origin, no mixed content, confidentiality of
communication contents). Would ongoing work on signed exchanges be
helpful?

Do epubs allow entry of user-generated text? Does that text remain
local? How does a user distinguish between interactivity that is
provided by the reading system and interactivity that is provided by
the book itself? When are they communicating with which piece of
software? Annotations -- including highlights, margin notes, answers
to in-book surveys, etc. -- can reveal very sensitive information that
a reader might not wish to disclose to anyone else.

Do reading systems distinguish chrome in a way that provides security
to the end user? Do ebooks typically display at full screen? Can they
mimic websites and phish users? Our Web privacy guidance typically
includes questions about "native UI" to cover cases like these: if
there is no distinction between the UI provided by the user agent and
the UI of the browseable content itself, then an interactive web site
(or ebook) can effectively pretend to be a different site, and phish
user credentials, for example. This would be a particular concern if
epub reading functionality was provided by a web browser and users
became accustomed to clicking links in ebooks to continue browsing
elsewhere.

## privacy and drm

What are the privacy impacts of DRM and how can they be mitigated?
EPUBs are regularly distributed (after purchase, or perhaps more
accurately, paying for limited license to read) with the contents
fully encrypted and mediated by a DRM system (in particular, Adobe
Digital Editions).

Public reporting indicates that EPUB DRM includes collection of unique
identifiers tied to the user's hardware, as well as personally
identifying information about the user (like an email address), and
then makes regular updates on where the user reads the book, how much
of it they've read, how much time spent reading, how much of the text
they might have copied or printed, and that this data is regularly
sent home without the user's understanding or control, perhaps over
insecure transports. Maybe some of these issues have been subsequently
addressed or mitigated, but it's not clear how users could have
confidence in that.

DRM and DRM licensing also effectively prevent user transparency into
the privacy properties of the EPUB file itself: users cannot inspect
the EPUB files that they read, cannot use open source software to
analyze them, cannot customize software to limit how resources or
loaded or information about them is shared.

Direct, opaque surveillance of user reading behavior tied to unique
identifiers and personal information, combined with a lack of
capabilities for user transparency or control -- this would be a dire
precedent for Web privacy and is inconsistent with existing Web
privacy model. CDT raised questions about the privacy impacts of DRM
starting 15 years ago [2], and the current state does not seem to
support user privacy.

[2] https://cdt.org/wp-content/uploads/copyright/20060907drm.pdf
See in particular: "transparency" and "collateral impact"

While this spec does not directly standardize DRM of EPUB files, it
does design the file and package format in order to facilitate the DRM
encryption of every file within the EPUB package. From the design of
this specification, users cannot typically look at the table of
contents or the stylesheet of a purchased ebook.

When W3C has considered DRM in extensions to HTML, there have specific
measures taken to reduce the harm [3], including trying to limit the
scope and mitigate the privacy and security problems. Those
mitigations have included limits on scope of identifiers, making data
clearable, prohibiting network access, limiting access to identifiers
from applications, and user transparency and consent.

[3] https://www.w3.org/TR/encrypted-media/#privacy

For what it's worth -- this isn't specifically privacy-related, but
can have downstream effects -- DRM as applied to EPUB is currently
dramatically limiting interoperability for end users. I have been
unable to buy a book and then move it to another device and another
ebook reading system, and I'm not aware of anyone else who has done
this successfully; users are locked in to the app or device of their
bookstore. This finding from the TAG seems relevant [4]: "The constant
competition and variety of choices that come from having multiple
interoperable implementations means the web ecosystem is constantly
improving."

[4] https://www.w3.org/2001/tag/doc/ethical-web-principles/#multi

### obfuscation

What value is provided by this obfuscation algorithm? Can this feature
be marked at risk, given the uncertainty about whether it satisfies
any vendor goals while increasing the complexity of the spec and
making the source less easily inspectable by the reader?

Why does the creation of the obfuscation key based on the SHA-1 hash
function include a SHOULD requirement rather than a MUST? This
relaxation seems primarily to decrease interoperability.

The obfuscation section contains no requirements on reading systems.
Maybe they are just implicitly supposed to de-obfuscate these
resources in order to render the book as intended. Are reading systems
expected not to provide unobfuscated access to these obfuscated files
to users?

Is obfuscation limited to font files? Is there a reason that other Web
technologies for fonts cannot be used?

User-agent-provided obfuscation of arbitrary Web resources would be a
step welcomed by some, but would be considered a new and user-hostile
part of the Web as a platform. Is there a reason obfuscation is
particularly needed for EPUB, or would the same reasoning be expected
to apply to the Web generally? Is obfuscation consistent with ethical
web principles, in particular the "view source" principle [5]?

[5] https://www.w3.org/2001/tag/doc/ethical-web-principles/#transparent

## information exposure and fingerprintability

What data on user device is revealed and what is the risk of fingerprintability?

This spec appears to define a reading system user agent string
`epubReadingSystem`. Is there still a `navigator.userAgent` or
similar? Is this a replacement or a duplication of that feature? There
is also ongoing work to limit or deprecate user agent strings on the
Web platform -- to make it an explicit opt-in rather than always
disclosed in great detail. At the very least, we need to recommend
that user agent strings have entropy that is strictly limited as
necessary for debugging and compatibility. And it should be noted that
`epubReadingSystem` reveals information about how the reader is
reading the book, potentially back to the author/publisher of the
book, unless scripting is more strictly limited.

Feature detection is generally preferable, although that could also
reveal information about the user's configuration, especially if this
list of features is likely to vary based on user
configuration/capability as opposed to just which ebook reader is in
use. Our Note on mitigating fingerprinting has some relevant advice on
how to evaluate severity and apply mitigations [6].

[6] https://www.w3.org/TR/fingerprinting-guidance/

As every EPUB is considered a separate origin, the threat model here
is: can an author/publisher of multiple ebooks learn from these
configuration characteristics that the same user is reading both of
them? And on information disclosure (perhaps because the ebook
publisher already knows the exact customer who purchased that
particular copy of that book), does the publisher learn something
about the customer's devices or software choices?

## EPUBs and browseable web

A lingering question and concern for me is how interaction will take
place between reading an EPUB ebook and browsing the Web more
generally. If a reader taps on a link in an ebook, will any
information about the ebook (or personal information about the reader
who purchased the ebook) be passed on to the target website of that
link? Or will there be links between ebooks or from a website into a
particular page on an ebook?

Documenting each EPUB as a separate origin for security and privacy
purposes is useful. A fear of having opacity about DRM and what
personal information about a reader is embedded in an ebook is that
those identifiers could also leak outside of the context of reading
that book, through annotated links, for example, or to embedded
iframes/third party resources.

I believe the security and privacy considerations section of the EPUB
spec should be very substantially expanded. In particular, currently
it considers the privacy of ebook creators who may embed personal
information (and I do think that's relevant to think about, and I
haven't really covered it in my reviewing), but considers less threats
to the privacy of readers of EPUB books.

Applying the self-review questionnaire on security and privacy [7] to
EPUB may not be trivial as some of the characteristics of ebooks are
different, but hopefully there can also be useful analogs that we've
raised in this discussion. And there may be direct connections
especially if we want ebooks and web sites to be more fluidly accessed
in the future.

And as always feedback would be welcome on how to improve that
questionnaire or our process of privacy reviews in making it more
useful.

[7] https://www.w3.org/TR/security-privacy-questionnaire/
Received on Monday, 25 October 2021 23:51:58 UTC