Re: Identifying a book on the Web today from Baldur Bjarnason on 2017-08-03 (public-publ-wg@w3.org from August 2017)

From: Baldur Bjarnason <baldur@rebus.foundation>
Date: Thu, 3 Aug 2017 11:21:14 -0400
To: Ivan Herman <ivan@w3.org>
Cc: Benjamin Young <byoung@bigbluehat.com>, David Wood <david.wood@ephox.com>, MURATA Makoto <eb2m-mrt@asahi-net.or.jp>, W3C Publishing Working Group <public-publ-wg@w3.org>
Message-Id: <8D3C51CD-6442-4A93-A811-42F81A803849@rebus.foundation>
> On 3 Aug 2017, at 05:54, Ivan Herman <ivan@w3.org> wrote:
> 
> [Admin comment: if we think this issue is a genuine discussion to have, we should open a github issue, in this case probably at the global level, ie, on publ-wg. Referring back to discussion threads on a mailing list is WAY more difficult later (say, in 2 years) than it is on github, and we will make our life much easier if we do it that way. Baldur, if you agree, adding a new issue would be a good idea, with a cut-and-past of your text.]

I’m not clear on what the division of labour between GitHub issues and the mailing list is supposed to be. The issues are pretty darn inaccessible, often with dozens upon dozens of replies, most of them pretty repetitive. It has honestly become quite intimidating.

The email I sent is a high level overview of the topic (with references, because we aren’t working in a vacuum) and specifically avoids taking sides on implementation issues, so I’m not sure if it fits in with the current style of discussion in the GitHub issues. If you think this belongs on GitHub, point me at the right place and I’ll copy paste my email either as a new issue or as a reply to a pre-existing issue. No problem.

Though, some general guidance on when something should go to GitHub and when it belongs to the mailing list would be very useful.

***

I’m not sure that what I’m arguing in my earlier email is a single issue, specifically. It’s more that it was an attempt to argue that we have four separate issues that we really should be discussing separately and not bundled together. And that one of the issues in this bundle is positively derailing and is a topic we should be avoiding.

This harkens back to Hadrien’s earlier point about how working from high level and vague definitions is more vulnerable to disagreements and derailing than working from a bullet point list of requirements or issues.

When it comes to identifiers:

- We are going to have URLs that function as locators and (if an identifier is omitted) serves as an identifier.

- We are going to need a discovery mechanism (almost certainly a link of some sort).

- We are going to need to support other identifiers, IRIs generally and URN namespaces specifically for ISBN support (without requiring their presence, as URLs can function in their place if the author chooses to omit one).

- We really, really, _really_ shouldn’t be minting new URL schemes, protocols, or registry/redirection mechanisms.

The first three are all separate topics that should be given the attention they need on their own (i.e. they are a bullet point in our list of things to figure out).  They are nuanced—possibly tricky—subjects that we need to clarify to reach consensus. Even my presentation of each issue is up for debate.

But we keep bundling these important topics with a dead-end “let’s reinvent identifiers/build a redirector/create a new addressing scheme” discussion. Which is where this thread was going and where most of our discussions on identifiers seem to head towards both here and in the GitHub issues (AFAICT from my attempts to read through the issues). It muddies the waters for trying to discuss the other identifier issues which are much more manageable on their own.

Hypothetical new variations on PURLs, DOIs, and URI schemes are, I’m pretty confident to state, beyond the scope what this working group can and should be working on. Even if it were in scope, I’d still think we shouldn’t be doing it and should instead leave it to other organisations who are better at this sort of thing (like the IETF).

The only _important_ issue in the email, though, is that I would like us to stop pursuing ideas that are extremely unlikely to happen or, in the unlikely event they do happen, are better handled by other organisations such as the IETF.

Issues like how to locate, dereference, and identify resources in the context of the burgeoning packaging spec is an issue for the packaging spec not for this WG. And that spec, quite rightfully, seems to be heading to the IETF. Which, as you might have guessed from the plurality of IETF documents I referenced in my email, is exactly the group that I think should be working on a format that archives and extends the HTTP protocol. They do HTTP. They do identifiers. They do protocols. They are the proper venue for this sort of invention and innovation.

- best
- Baldur Bjarnason
  baldur@rebus.foundation

> 
> Hey Baldur,
> 
> 
> I am not sure how the discussion got to these high level points, to be honest. I do not think (or at least I hope) anybody seriously considered defining our own identifier scheme, alternative protocols, etc; I think we should definitely keep away from those issues. We work with what is on the Web and, I believe, our mantra is to minimize any specification we do and definitely avoid touching the fundamentals. 
> 
> Ie, I basically agree with what is below, just let me add some non-fundamental comments. 
> 
> 
>> On 2 Aug 2017, at 22:04, Baldur Bjarnason <baldur@rebus.foundation> wrote:
> 
>> 
>> * A URL as both a locator and identifier is a given—if it’s on the web, that’s how it’s going to work—but we can’t change how a URL functions or behaves.
> 
> I believe, as we emphasized in the PWP document in the DPUB IG[1], we have to be very clear that these two notions/roles are separate and they may or may not coincide. We have to accept that there are communities that do use identifiers that are not a URL (ISBN is the typical case, with all its flaws).
> 
> I think what _is_ a given is that we have a URL that acts as a locator on the Web, because it _is_ the Web. And, of course, we have to accept how URL-s are defined, and we have to accept (and possibly exploit!) how URL-s and HTTP behave. But let us not decide in general that this URL is the identifier or not (see also my comment below).
> 
>> * Using a URL that doesn’t identify the publication (e.g. an external HTML page) to help people indirectly locate a publication should be a feature that we provide by specifying some form of discovery mechanism (some form of link—HTTP header or link tag—with a format-specific rel value is the usual way of doing this).
> 
> I am not sure I 100% understand what you mean here. I guess you refer to the (still undecided) issue of locating the WP's manifest (however it will look like) using a URL. If so then yes, I completely agree that we have to provide a discovery mechanism.
> 
> But… alas! it is not easy to set up, at least for a lambda user, a proper HTTP based mechanism like, eg, content negotiations or controlling the return headers. This is also a constraint we will have to work with, content negotiation should probably be one but not _the_ mechanism to achieve that.
> 
> (The difficulties to control those things is one of the reasons that the Web developers community often seems, these days, to reject any HTTP based mechanisms…)
> 
>> * A secondary globally unique identifier that is separate from the identifying and locating URL is useful for a variety of reasons but requiring one has as many downsides as it has upsides—the biggest downside being that most developers won’t provide one even if that makes the web publication invalid. I’m sure we will debate this but given that the functional advantages are largely in the area of distribution and portability I don’t see why this should be a requirement for non-portable web publications.
> 
> I would not refer to this as "secondary". As I said above, I believe we should separate the notion of a (globally unique) identifier from a locator and, hopefully, on Monday we could agree on some minimal level of requirements that we consider as fundamental in using something as an "identifier". We should be agnostic on whether a specific URL can be considered as an identifier or not, we should just recognize that these two notions are different and, in our manifest, we should provide a slot to add both.
> 
> As for the requirement (or not) of having it: I guess, in spec parlance, what you say is that having at least one global identifier assigned to a WP is a SHOULD but not a MUST. And, for the reasons you cite, I agree with this. 
> 
> That being said, there may be communities (either via explicit profiles that we may define later or just throug some social agreement) that would have that as an absolute requirement, ie, a MUST. Scholarly journals is a typical case: having a globally unique identifier assigned to an article (which should be a WP) is an absolute must in that community and, furthermore, URL-s are not necessarily accepted as such (DOI-s are used for it these days)[2]. I would expect legal documents having a similarly strong requirement. Maybe the usage of MUST instead of SHOULD would be part of specific profiles, could be a requirement for a PWP or an EPUB4. This is for later.
> 
>> * We absolutely should not venture into the territory of extending existing protocols, minting new identifying schemes, or specifying a locator mechanism that mandates the implementation and maintenance of what are likely to be non-trivial server systems.
> 
> Absolutely and completely true. Actually, the charter puts the definition of new identification schemes explicitly out of scope, but what you say here is even a bit more general. 
> 
> Thanks!
> 
> Ivan
> 
> [1] https://www.w3.org/TR/pwp/#identification
> [2] I must note that there are serious debates about this in the scholarly publishing community, with some asking for the abolition of the predominance of DOI-s.
> 
> ----
> Ivan Herman, W3C 
> Publishing@W3C Technical Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
>
Received on Thursday, 3 August 2017 15:21:39 UTC