Feedback on WP spec from Edge team from Ben Walters (CPE PARIS) on 2018-04-12 (public-publ-wg@w3.org from April 2018)

From: Ben Walters (CPE PARIS) <Ben.Walters@microsoft.com>
Date: Thu, 12 Apr 2018 12:09:12 +0000
To: W3C Publishing Working Group <public-publ-wg@w3.org>
CC: Mustapha Lazrek <mustlaz@microsoft.com>
Message-ID: <VI1PR83MB0160F8D0357F9528DF9EBC06ECBC0@VI1PR83MB0160.EURPRD83.prod.outlook.com>

We've reached out to our colleagues in the Edge web platform team so that we can provide an initial round of feedback on the Web Publications spec taking into account both the platform / standards and the Edge books teams (which Mustapha and I belong to). We'll drill into more details in some of these areas in individual issues on GitHub.

Thanks,
Ben

It's not clear from the beginning who is expected to implement the spec-are user agents web browsers only? Are user agents also book reading apps? Can user agents be web sites, extensions, or progressive web apps themselves? If it's all of the above, are there different requirements for web browser vs. others? It's difficult to review certain technical details without establishing this context. For the rest of this feedback, we're assuming that the user agent == browser.

Abstract
* The Spec starts on the assumption that "user agents can provide user experiences well-suited to reading publications, such as sequential navigation and offline reading."
* The readability aspect of this makes sense; "Reading Mode" as a browser feature, knows how to interpret some meta data on an HTML page that marks up its related contents, (e.g., <link rel=next>) to assemble these resources into a convenient reading experience. It also acts as a content filter, trying to identify the main content to read... and tends to exclude ads.
* Offline reading, however, is not a feature that browser user agents can do well--the responsibility (and the complications and maintenance of the offline experience) for a web page/application is the responsibility of the resource itself (e.g., to acquire, install, and cache its resources). Simple "Save As" browser actions rarely do the right thing, as what it means to be "offline" is very site-specific.

Introduction
* The introduction is too editorial, and it even sounds adversarial to stakeholders of the open web platform, and it should probably be altered to be more grounded in reality. For example:
* Life on the web is mostly 'doom and gloom' for resources. (neither web resources nor printed material has feelings)
* Web pages are painted as 'lacking cohesion'. (this is only a function of the web publisher's design, not a de-facto flaw of the web as a publishing system)

* What's the real problem here? Is it that there isn't a machine understandable way for web browsers to understand the structure of a long-form document and provide uniform, content-agnostic navigation on top of that? Or something else? And what are the primary scenarios? Scholarly publications? Free books on the web? What about pay walls? Without a clear motivation, it's difficult to make more detailed technical decisions.
* It is assumed that the user is the only entity that can interpret the cohesion between webpages, but that's not really true. The web was originally modeled after paper and documents, and as such has some *old*--and rarely used anymore, but available--machine-readable bindings between pages--namely the <link> element or 'Link' HTTP header (not for hyperlinks, but for relationships between pages): indicating things like the published works' license, the next, and previous pages in the series<https://html.spec.whatwg.org/multipage/links.html#linkTypes>. If those bindings don't work today, why not?
* The text implies that tables of contents and indexes are a feature that differentiates 'the traditional publishing model' from the web. That's clearly not true in the absolute, as this and other specs have both a TOC and index-like sections. What problem does that not solve?
* Implying that single-page publications must be slow isn't correct. It may be true that seeking to the middle of a long and complex document may be slow, but basic rendering of less complex content isn't necessarily slow, even if it's long. Again, the spec should be more specific if there are use cases that really don't work on the web today.
* 'multi-page publications cannot be easily taken offline because their common thread cannot be established' is not really true-either historically with things like the <link> tag or using service workers today.
* Why not evolve the existing Reading view features already available?
* In general, there is a lack of recognition of real prior art. The spec should mention at least the obvious cases like link tags and reading view and explain why they're not adequate to the task

What is Web Publication?
* The manifest-based approach could make sense for very static content, but it seems the web publications also want to make use of the dynamic nature of the web. How does that work? If I take a web publication offline, can it be updated? Do I need to re-download the whole thing again? How does versioning work?
* Figure 1 shows how the parts are related. HTML documents (which are viewed as resources listed in the infoset) may include other resources not listed in the infoset--Are these things not considered part of the publication? What is to be done about them when using the manifest to take the publication offline?
* Is an offline WP a PWP? Or is that a separate concept?
* Who is responsible for taking the WP offline? The book author or the browser? Or both?

1.3 Scope
* The spec seems expressly interested in having the web publication stay "live" rather than referencing dated publications--so changes are expected. For publications taken offline, is there an expectation that they take updates?

1.4 Relationship to App Manifest
* We've had some mixed feedback internally on adopting the WAM. If it can make sense for web publications to act like progressive web apps (installed side by side, responsible for their own offline experience, etc.) then adopting the WAM clearly makes sense. If a core tenet of WP is that publications only contain data and not code, then this just doesn't work, there doesn't seem to be much overlap between web publication requirements and WAM, and adopting things like link ref=manifest may cause more trouble than it helps if user agents can be things like extensions.

3.2 Infoset requirements
* Accessibility compliance information OK but what does it really mean? How are user agents expected to use it? Can we trust what the web publication says? What's the format? Is there free-form text?
* Address / Canonical identifier: Can we have concrete examples? If there really is need for a canonical identifier that must map to a URL, it should just be a URL.
* How does versioning work?
* Privacy policies: Why a web publication would need a privacy policy and not a regular web page (web sites usually have privacy policies, not individual pages)? What if I copy a web publication, do I copy the privacy policy with it?
* Either the title should be required or we should specify a "bad" fallback to the URL. We should avoid giving user agents flexibility to create a good experience out of bad data, as it then encourages tool authors to allow bad data to be created that may not work across different user-agents.

3.3.3. Canonical Identifier
* We agree with the premise of Issue 58<https://github.com/w3c/wpub/issues/58>--this doesn't seem necessary and is handled by multiple other features of the web--e.g., HTTP redirects for one.

3.3.7 Dates...
* What format?

3.4 Structural properties

3.4.1 Default Reading Order

* We need to simplify what we expect user agents to do and reduce the flexibility and complexity of web publications. From our perspective, everything required to render the web publication should be included in a manifest. It should be the role of tools to be able to extract data from HTML, not reading systems.

3.4.2 Resource list

* If taking the web publication offline...Any resource in the default reading order must also be included in the resource list. User agents should ignore the resources not listed in the resource list.
* Issue 59: Our preference is 3rd proposal from Matt: Requiring the publisher duplicate the resources across lists in the manifest. Again, keep the user-agents simple and consistent which means specifying independent lists if they serve different purposes.

3.4.3 Table of Contents

* Same as above. Why enforce a specific subset of HTML? Let tools do that and separate the concepts of inline human-readable and formatted table of contents and the user-agent-readable table of contents.

Received on Thursday, 12 April 2018 12:09:43 UTC