Excellent. The STM and standards communities are obvious beneficiaries of the Annotations work and would likely be early adopters compared to others within the broader publishing space.

And standards developing and publishing bodies are collaborating with each other and CrossRef as I write, Bill. I'm hopeful that this will facilitate earlier adoption of this group's deliveries!

Subsequent discussion on the list—much of which has gone beyond my technical knowledge—has focused on the ability of server-side processing via APIs to resolve some of these ambiguities. Which I was delighted to find: I was worried that the fact that a DOI could not be relied upon to resolve directly to one and only one version and instance of a document, and that what it resolves to can change over time, would make it impossible to use the way the Open Annotations WG needs annotations to work. Not only is that not necessarily the case, Paolo and others are actually using DOIs now for this purpose, with mechanisms to deal with those potential ambiguities.

Your clarifications are all helpful and probably quite relevant in that context, so I'm copying the WG.

One other point: although I realized that there are CrossRef DOIs for datasets (thanks for the details, and reminding me how much that is in fact done), my main point was that _different metadata_ is associated with a CrossRef DOI and a DataCite DOI. Is that correct? Or when there is a CrossRef DOI associated with a dataset, is the metadata the same as if it had a DataCite DOI? (BTW I also knew CrossRef and DataCite are collaborating: kudos for that, of course! Ditto for ORCID and ISNI, though "talking" rather than "collaborating" may more accurately reflect the status of that. I think ISNI is going to be essential for organizational identification, as a complement to ORCID for contributors.)


Yep, you got it right--just a few notes to elaborate below:

I don't think you two get the W3C OAWG e-mails, and I wanted you to see what I just sent. You both may have comments or corrections to what I wrote.


I just want to reinforce the importance of this issue. In fact from a use case POV I think there are two issues:

--The same document referenced by multiple URIs.
This CAN but is not always handled by CrossRef with Multiple Resolution, but only when the documents with the different URIs are the same versione--typically the version of record. In this case, one DOI has more than one URI associated with it. The service provides a user-choice popup. A specific URI can be accessed with a CrossRef DOI and parameter to by-pass the multiple resolution interface.

--Synchronizing annotations to multiple formats of the same document (that is the same _version_ of the same document . . . which implies what I would consider a third use case, versions, which we are already addressing).

And I also want to highlight this comment from Paolo:

> When tools like Domeo and Annotopia see a document, the first thing they do is capture available IDs. Domeo looks up for DOIs, PMIDs, PMCIDs, PIIs and so on. When sending the annotation to Annotopia, the bibliographic data are sent as description of the target document. This is done by reusing existing vocabularies/ontologies.

This is really essential for people to understand. I know many in this community are skeptical of IDs like the DOI that require implementation of support systems around them. But in the real world ;-) this is how this works.

The way to think of it is this: identifiers are proxies for metadata.

The systems associated with these IDs provide documented specifications for what _their_ metadata includes. And they usually also provide APIs for the retrieval of their stored metadata based on the identifier. So btw when Paolo refers to DOIs for a scientific or scholarly paper, he really means a "CrossRef DOI." A data set associated with that paper would have a different DOI (a "DataCite DOI") which would have entirely different metadata associated with it.

So this is very close to being true; just to be precise, data sets associated with papers can have "CrossRef DOIs"; The difference for CrossRef is really the community. If the publisher is hosting or maintaining the data, it may be easier for them to add dataset DOIs at CrossRef. And several significant databases have been assigning data set DOIs through CrossRef for years. An example is the Protein Data Bank. Another is the Organization for Economic and Cooperative Development (OECD). In fact there are almost a million data sets from 1100 databases with CrossRef DOIs. There are about 5 million DOIs assigned to data sets at DataCite.

CrossRef and DataCite have made a commitment to collaborate--for example, CrossRef's content negotiation APIs were extended to help with interoperability between the two registration agencies, and we have plans to work closely together going forward.

And the entertainment industry, which now also uses DOIs, obviously has entirely different metadata associated with those DOIs.

A sidenote: because the CrossRef DOI is so ubiquitous in STM, people tend to think it has _all possible metadata_. Nope! ;-) They think they can get an e-mail of a contributor from CrossRef, but that's not in the CrossRef metadata. But guess what? It's probably available via the ORCID ID that should be available in the CrossRef metadata, which would send a system to a different server to retrieve information about that specific contributor (and a scientific paper can have scores of contributors).

Yes this is right. Though right now there are not a ton of ORCIDs in the CrossRef metadata, they are growing and expected to do so faster as publishers figure out how to get the right data from the right systems to CrossRef.

Where I'm going with this is that it is WAY better to have these centralized, authoritative, ideally continually maintained repositories of _particular kinds of metadata with IDs associated with the metadata records_ than to try to ship boatloads of metadata all over the place with individual documents. Thus: Why We Need Identifiers, and Why Identifiers Need Support Systems.

Another example we've been looking at is institutional identifiers--candidates include Ringgold and the ISBN's new organizational ID.  We have a taxonomy of some funding institutions (and they have a Funder ID) as part of our FundRef funding data service.

For a given community of users (scientists, librarians, scholars, data curators), getting a known ID like a CrossRef DOI or a DataCite DOI or an ORCID is just amazingly efficient. The metadata thus available may not be useful or relevant to users outside that sector, but for the users for whom that identifier and its support system were created, it saves the day.

I realize that you may be thinking "well this is all very interesting but what does this mean for OA?" I guess my point is that these purpose-built identifiers and the systems associated with them will not go away. Lacking a canonical and ubiquitous "work identifier," this is the ecosystem that we are working with now.

The demo is very interesting. Tangentially, It may be of interest that we have worked with Ubiquity on a few projects and they have become a sponsoring entity that agrees to fulfill CrossRef membership obligations (depositing DOIs and creating outbound reference links and paying the bills) on behalf of small publishers who may not have the resources to do so themselves.

--Bill Kasdorf

comments in line

Thanks for providing a use case on the wiki -

I think what you are saying is that the same document can be provided in different formats (e.g. HTML or PDF) at different portals (e.g. PubMed Central vs authors personal web site etc) - I guess different portals could also offer the same format with different URLs as well.

Correct. This is a very common scenario for scientific papers, one of the main resources I annotate.

The use case also says that sometimes these various targets should be treated as the same despite having different URLs and sometimes should be treated as different, depending on user choice.

Correct. For instance if I annotate with Domeo an HTML version, I want to see the same annotations on my PDF version through the Utopia client. This is in fact already implemented through the Annotopia server:

Thus I have  questions

- how can a system know that two documents are different representations of the same document when they have different URLs?

When tools like Domeo and Annotopia see a document, the first thing they do is capture available IDs. Domeo looks up for DOIs, PMIDs, PMCIDs, PIIs and so on. When sanding the annotation to Annotopia, the bibliographic data are sent as description of the target document. This is done by reusing existing vocabularies/ontologies.

- why would a end-user want only to provide annotations for a specific representation of the same target and not have it apply to all versions?

It depends what is the task. If the task is to compare output formats you might want to do that. Also different formats might be different in layout and the annotation might be related to that.
In general, it is important to know exactly which variant motivated the annotation so that the process can be fully understood.

- should we simplify the use case to how to share annotations for a target that has multiple instances with different URLs.

I guess so. Keeping in mind that one URL can refer to HTML and one to PDF?

It seems the big issue here is that different URLs might refer to the same target, and how to handle that.

Yup. In my case I incorporate bibliographic data in the annotation. In alternative something else need to do that job of finding that out.

I know I’m jumping ahead, but thought I’d ask now.

Good you asked :)

Dr. Paolo Ciccarese
Assistant Professor of Neurology, Harvard Medical School
Assistant in Neuroscience, Massachusetts General Hospital
Senior Information Scientist, MGH Biomedical Informatics Core

