Re: Identifying a book on the Web today

Hi all,

I am concerned here that we risk going beyond what this working group can specify within the timeline that the charter has laid out for us. Generally speaking identifiers that have an actual function (i.e. are not just a GUID) are specified either as a layer on top of an existing protocol (like PURL is built on top of URLs and HTTP) or as new parameters of a protocol.

The first case is incredibly fragile (redirecting registries require funding, maintenance and organisation and shouldn't be a feature of a specification intended for general use). Maintaining a separate system on top of URLs has proven in the past to be pretty unreliable and hard to scale and comes with a host of security, privacy, and stability issues (e.g. think of the breakage caused by URL shorteners).

The second case is _firmly_ in the IETF/IANA wheelhouse. The IETF specifies the protocols and which parameters they support and the IANA maintains registries for those parameters for the IETF. A lot of the suggestions I’ve seen on this list regarding identifiers are in the territory of extending HTTP as a protocol (or adjacent to that territory, looking over the fence) which is a very problematic path for us to take.

In functional terms there’s very little a _web_ specification can mandate in terms of identifiers. URLs are an implicit part of the web as a system so we get those for free—the only downside is that we can’t change how they work (e.g. can’t mandate permanence or immutability of any sort). 

We could require that each publication has a globally unique identifier (GUID) of some sort (there are many sorts) without specifying which type of identifier. People could then choose to reuse the URL, mint a tag uri[1], or use an officially registered urn namespace of some sort (which includes ISBNs).

Atom has this requirement and it has been a source of much confusion but because atom is a syndication/distribution protocol the benefits were seen to outweigh the downside.

This does not apply that much to web publications but might be relevant to portable web publications. A good outline of why two separate identifiers (URL and guid) is useful in the atom context can be found in one of Mark Pilgrim’s old blog posts (now archived on archive.org[2]). (Short version: urls can change which can cause confusion and repetition when you’re distributing a document widely beyond its origin.)

RSS has a guid element but it’s optional. The idea there is that in its absence the feed consumer should treat the URL (i.e. link element) as the GUID. When an item lacks both its treatment is unspecified. (There are many reasons why RSS is so tricky to implement but its tendency to leave large parts of how it should work unspecified is a big part of it.)

The recent JSONFeed effort[3], interestingly, switches RSS's requirements around. The id is a required property (of the entry only, not the feed as a whole), needs to only be unique in the context of a single feed, and in the absence of a url property, should be considered to be the item’s URL.

I think there is value in making a GUID for both the publication and individual primary resources an optional ‘best practices’ sort of thing for regular web publications (UAs and distributors would then treat the URL as a GUID in the absence of a separate GUID). Web publications won’t see much practical benefit (actual GUIDs separate from URLs are unlikely to be supported widely enough to be of too much help for archiving or categorisation) but it could be really helpful in portable publications. A useful GUID, coupled with a modification date would give many implementors the tools to at least attempt to tackle a whole host of important distribution problems. I think requiring them for either portable or non-portable web publications is pretty risky given how badly similar requirements have faired in earlier specs (e.g. Atom).

As I’ve said before (at least in a GitHub issue somewhere) the links people distribute as entry or acquisition points for a web publication don’t necessarily have to be the identifiers or locators for the publication itself. One of the things we’re going to have to specify is some sort of discovery process where UAs can automatically discover a web publication’s actual location based on the links present in an HTML file. This lessens the location and discovery requirements we make of the web publication URL itself.

Fortunately, most of the prototypes and prior work people have been making in this area for web publications support using the URL as a primary identifier while letting people add a secondary identifier that helps with disambiguation, categorisation, and distribution. 

- Schema.org generally relies upon each individual format’s support for identifiers, which in JSON-LD’s case is the same as what Readium 2’s manifest format uses (see below) but also provides a url property for URLs specifically (as opposed to IRIs)[4]
- Readium 2’s manifest requires both a ‘self’ link to identify the publication’s canonical URL (see below) as well as an identifier property (which, IIRC functions as a JSON-LD @id/linked data IRI[5] and [6])
- HTML metadata supports both listing a canonical URL identifier and locator using link rel[7] and using Dublin Core in a meta tag to record other secondary identifiers[8] and [9]. In the absence of rel=“canonical” the current URL is considered to be the canonical one.

You’ll note that the majority of the references here are to IETF documents, provide information about IETF protocols (like Pilgrim’s post), or delegate the definition of the identifier to an IETF document (like JSON-LD does with IRIs). This is not a coincidence. This is traditionally a problem area where the IETF specifies the solutions and the IANA maintains the registries and the W3C is just a user. 

We can easily specify certain types of identifiers (e.g. ‘this needs to specifically be a URL and this other thing needs to be an IRI’ ). 

But anything that requires building and maintaining new systems, new registries, new parameters to existing protocols, or (heavens forbid) a new protocol entirely has a very strong likelihood of requiring the involvement of other standards organisations and derail this entire specification effort completely. I cannot stress how much of a bad idea this would be given the time constraints the working group’s charter sets us.

To summarise:

* A URL as both a locator and identifier is a given—if it’s on the web, that’s how it’s going to work—but we can’t change how a URL functions or behaves.
* Using a URL that doesn’t identify the publication (e.g. an external HTML page) to help people indirectly locate a publication should be a feature that we provide by specifying some form of discovery mechanism (some form of link—HTTP header or link tag—with a format-specific rel value is the usual way of doing this).
* A secondary globally unique identifier that is separate from the identifying and locating URL is useful for a variety of reasons but requiring one has as many downsides as it has upsides—the biggest downside being that most developers won’t provide one even if that makes the web publication invalid. I’m sure we will debate this but given that the functional advantages are largely in the area of distribution and portability I don’t see why this should be a requirement for non-portable web publications.
* We absolutely should not venture into the territory of extending existing protocols, minting new identifying schemes, or specifying a locator mechanism that mandates the implementation and maintenance of what are likely to be non-trivial server systems.

[1]: https://tools.ietf.org/html/rfc4151 (Personally, I think the tag URI scheme is pretty cool but I know there have been disagreements on it.)
[2]: http://web.archive.org/web/20110514113830/http://diveintomark.org/archives/2004/05/28/howto-atom-id
[3]: https://jsonfeed.org/version/1
[4]: http://schema.org/url
[5]: https://www.w3.org/TR/json-ld/#node-identifiers
[6]: https://www.ietf.org/rfc/rfc3987.txt (The relationship and mapping between URLs, URNs, URIs, and IRIs is well beyond the scope of this email and not really important at this stage, IMO.)
[7]: https://tools.ietf.org/html/rfc6596
[8]: http://dublincore.org/documents/dcq-html/
[9]: http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=elements#terms-identifier

- best
- Baldur Bjarnason
  baldur@rebus.foundation



> On 2 Aug 2017, at 13:56, Benjamin Young <byoung@bigbluehat.com> wrote:
> 
> Thanks for mentioning these, Makoto. David, I’d be curious to hear more about this approach “talking [your] language.” :)
>  
> Essentially, theses (PURL and w3id.org at least) are alternate registries—in the same way ORCID, RRIDs, etc are centralized naming + locating registries.
>  
> I’m not keen on our *specification* selecting a singular, centralized registry for lookup/dereferencing as I feel it’s rather unwebby…
>  
> However, there’s certainly value in folks minting a longer lasting URL (essentially) for their Web Publication where the identifier (at least) is maintained by an (ideally) long lived entity (archive.org, etc) who will pay the Domain Name Tax every year (or already has…) to keep the identifiers online and (again, ideally) dereference-able.
>  
> That said, something that can exist “on top of” or “outside of” the domain name “rental economy” would scale better.
>  
> Or…perhaps…ignoring permanence altogether or narrowing the vector where “permanent” matters—like “my computer calls this publication http://...blah...”
>  
> Anyhow…needs more thinking…scoping…and probably coffee. :)
>  
> Cheers!
> Benjamin
>  
> From: David Wood [mailto:david.wood@ephox.com] 
> Sent: Tuesday, August 1, 2017 10:22 PM
> To: MURATA Makoto <eb2m-mrt@asahi-net.or.jp>; public-publ-wg@w3.org
> Subject: Re: Identifying a book on the Web today
>  
> Makoto is talking my language!
>  
> On Wed, 2 Aug 2017 at 09:58, MURATA Makoto <eb2m-mrt@asahi-net.or.jp> wrote:
> Benjamin,
>  
> Thank you for posting this.  I am wondering if permalink, PURL, or  w3id.org
> is useful.  See https://en.wikipedia.org/wiki/Persistent_uniform_resource_locator
>  
> Regards,
> Makoto
>  
> 2017-08-01 3:15 GMT+09:00 Benjamin Young <byoung@bigbluehat.com>:
> Hi all,
> 
>  
> 
> I've mentioned in other threads that I'm currently exploring from the existing state of book(s) on the Web and hunting for openly licensed and easily fork-able/edit-able books to iterate from.
> 
>  
> 
> Right now, the one I look at most is CouchDB: The Definitive Guide, and below are a list of things that can be used to both identify that publication and/or locate an instantiation of that publication (on my shelf, via my browser; or your shelf or your browser). Here goes...
> 
>  
> 
> CouchDB: The Definitive Guide
> 
>  - identifier
> 
>  - *not* a locator...but useful for searching a shelf or the Web to find the location of an instantiation)
> 
>  - does not include clarification of rendition, language, or format
> 
>  
> 
> guide.couchdb.org
> 
>  - identifier (similar to the above)
> 
>  - *not* a (direct) locator of the publication
> 
>  - useful to humans ("visit guide dot couchdb dot org to get the book) and browsers (type it in, seach for it)
> 
>  - lands at a promo page dedicated to routing to distinct instantiations of the book...but is as close to a "webby" non-rendition-specific identifier+locator combo
> 
>  
> 
> http://guide.couchdb.org/editions/1/de/index.html
> 
>  - identifier and locator
> 
>  - specific instantiation--exact rendition, language, and format
> 
>  - locates the Table of Contents of this rendition (etc) on the Web
> 
>  
> 
> http://guide.couchdb.org/editions/1/de/
> 
>  - identical instantiation returned (ToC) as with the above...but completely different identifier (i.e. they're not programmatically equivalent)
> 
>  
> 
> http://shop.oreilly.com/product/9780596155902.do
> 
> (also http://shop.oreilly.com/product/9780596155902)
> 
>  - identifier
> 
>  - *not* a locator
> 
>  - similar to guide.couchdb.org (but less useful to humans)
> 
>  - provided routing to distinct instantiations ("buy now")
> 
>  
> 
> https://www.safaribooksonline.com/library/view/couchdb-the-definitive/9780596158156/
> 
>  - identifier and locator
> 
>  - specific instantiation--exact rendition, language, and format
> 
>  - (current English version for me...might language negotiation for you...not tested)
> 
>  
> 
> http://www.powells.com/book/couchdb-the-definitive-guide-9780596155896/61-1
> 
>  - identifier
> 
>  - *not* a locator
> 
>  
> 
> ISBN13: 9780596155896
> ISBN10: 0596155891
> OCLC: 935422678
>  - identifier
>  - *not* a locator
>  
> 9780596155902
>  - identifier -- presumably (given the URLs above) an O'Reilly specific product identifier
>  - *not* a locator
>  
> urn:x-pdf:b65cf356c8b20307000445bd151b8017000000
>  - identifier
>  - *not* a locator
>  - used by Hypothes.is for finding annotations about the publication
>  - generated by PDF.js using this code:https://github.com/mozilla/pdf.js/blob/58c3ea08202becf007c304512c44726719acb508/src/core/core.js#L513
>  
> https://www.dropbox.com/home/Apps/O'Reilly%20Media/CouchDB_%20The%20Definitive%20Guide
>  - identifier
>  - *not* a locator
>  - folder that contains my personal, digital instantiations (epub, mobi, apk, pdf) of the publication
>  
> https://www.dropbox.com/home/Apps/O'Reilly%20Media/CouchDB_%20The%20Definitive%20Guide?preview=CouchDB_+The+Definitive+Guide.epub
>  - identifier and locator
>  - returns the English, First Edition, EPUB format rendition of the publication
>  
> C:\Users\byoung\Dropbox\Apps\O'Reilly Media\CouchDB_ The Definitive Guide\CouchDB_ The Definitive Guide.epub
>  - identifier and locator
>  - but it only works for me...offline
>  
> Obviously...there are many other identifiers and locators for both the publication or it's instantiations.
>  
> Some of these identifiers are used inside of locators. Other locators make no mention of the other identifications (ISBN, for instance isn't referenced from theguide.couchdb.org site...strangely).
>  
> So. Which one of these should my Web Browser use when identifiying the publication and/or the instantiation? It has (now) seen all of these in some fashion (location bar, human-readable page contents, machine readable content). Is there one that can be considered "canonical"? Given that domains are *rented* can any URL be considered "permanent"? Does that even matter?
>  
> >From a user perspective, I have activities I want to accomplish (search, annotate, read, discover), and some amount of identification (a URL, an ISBN, a PDF fingerprint) to start the process. In the end, I only care about accomplishing my activity--and I want all the technology to which I have access to come to my aid to help me accomplish it.
>  
> Defining the technical details and requirements to facilitate end user experiences is what we're here to accomplish.
>  
> For those interested in exploring what options we collectively have to solve these issues, please feel free to contribute your own personal explorations of identifiers and locators.
>  
> There will be Pros and Cons to all of the things we pick among, and my hope is that we can collectively begin defining them when implementation ideas are put forward.
>  
> Thanks for reading all this (if you did ;) ). I look forward to many more such explorations as we move forward.
>  
> Thanks, all,
> Benjamin
> --
> Information Architect
> 
> John Wiley & Sons, Inc.
> 
> --
> 
> http://bigbluehat.com/
> 
> http://linkedin.com/in/benjaminyoung
> 
> 
> 
>  
> -- 
> 
> Praying for the victims of the Japan Tohoku earthquake
> 
> Makoto
> -- 
> Regards,
> Dave

Received on Wednesday, 2 August 2017 20:05:30 UTC