Re: article versions with JSON-LD from Stian Soiland-Reyes on 2015-10-21 (public-linked-json@w3.org from October 2015)

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Wed, 21 Oct 2015 12:05:49 +0100
To: public-linked-json <public-linked-json@w3.org>
Message-Id: <1445419973-sup-2449@biggie>
Excerpts from Nico Schlömer's message of 2015-10-19 17:22:02 +0100:
> A scientific article is typically published in several revisions,
> e.g., a bunch of revisions on a preprint server like arXiv [1] plus
> possibly a version somewhere on a publisher's website [2]. The
> versions will generally be perceived as "the same article", but differ
> a little bit here and there. I'd hesitate to refer to them as the same
> document.

"The same" is almost never actually the same. It's generallty a matter of 
different abstractions and different representations 
and descriptions which might have varying timespans.


As pointed out earlier there are various vocabularies that deal with this in
different forms. I'll add some more to the list.. sorry!


## Fabio

For publications I would look at the SPAR ontologies <http://www.sparontologies.net/> 
which include FaBIO <http://www.sparontologies.net/ontologies/fabio>, 
a mapping of the well-known FRBR model (Work/Expression/Manifestation/Item) as
used by libraries.

A particular Work can have many Expressions, 
e.g. a presentation, a journal article, a poster paper.
Identifying the work is usually hard as this is a higher-level abstraction that
is often not tracked on its own.

My take on using FaBIO is with an anonymous top-level fabio:ResearchPaper:

{ "@context": { "frbr": "http://purl.org/vocab/frbr/core#",
                "prism": "http://prismstandard.org/namespaces/basic/2.0/",
                "fabio": "http://purl.org/spar/fabio/" },
  "@type": "fabio:ResearchPaper", 
  "frbr:realization": [ 
    {  "@id": "http://arxiv.org/abs/123.45",
       "@type": "fabio:Preprint"
    },
    {
       "@id": "http://dx.doi.org/10.999/1.2.3.4",
       "@type": "fabio:JournalArticle",
       "prism:doi": "10.999/1.2.3.4",
        "frbr:embodiment": [
            { "@id": "http://journal.example.com/article15.pdf",
              "@type": "fabio:DigitalManifestation"
            },
            { "@id": "http://journal.example.com/article15.epub",
              "@type": "fabio:DigitalManifestation"
            },
            { "@type": "fabio:PrintObject",
              "frbr:hasExemplar": {
                  "@id": "http:/library.example.org/physical/152"
              } 
            }
        ]
    }
}


FaBIO allows you to provide fine distinctions such as preprint vs postprints, 
definitve versions, ordered author lists, etc, and its background from
libraries means it's also easy to relate to physical objects like a 
printed journal exemplar in a particular library.


## Dublin Core Terms

I would warn about what first might seem like an obvious choice from the
Dublin Core Terms <http://purl.org/dc/terms/> - which has
dcterms:isVersionOf and dcterms:isFormatOf (and their inverse
dcterms:hasVersion / dcterms:hasFormat ) were meant for - sadly these
properties have been misused by users who don't read the descriptions, in that
"hasVersion" have been misunderstood to only point to a versioned snapshot 
description of the same resource (e.g. not a variation-version).

Similarly dcterms:hasFormat ("A related resource that is substantially the same as the pre-existing
described resource, but in another format.") has been misunderstood to point to
a format definition rather than the resource in a particular format
(dcterms:format).


I would use DC Terms somewhat like this:

{ "@context": { "dcterms": "http://purl.org/dc/terms/" },
  "@id": "http://dx.doi.org/10.999/1.2.3.4",
  "dcterms:hasVersion": [
      { "@id": "http://journal.example.com/article15",
 "dcterms:hasFormat": [
            { "@id": "http://journal.example.com/article15.pdf" },
            { "@id": "http://journal.example.com/article15.html" },
            { "@id": "http://journal.example.com/article15.epub" }
        ],
 "dcterms:isVersionOf":  { "@id": "http://dx.doi.org/10.999/1.2.3.4" }
      },
      { "@id": "http://arxiv.org/abs/123.45",
        "dcterms:hasFormat": { "@id": "http://arxiv.org/pdf/123.45.pdf" },
 "dcterms:isVersionOf":  { "@id": "http://dx.doi.org/10.999/1.2.3.4" }
      }
   ]
}
         

Obviously you can structure this top-down or bottom-down to your liking
(perhaps reflecting better the HTTP JSON-LD resource that was requested), and
use dcterms:is*Of or dcterms:has* depending on the direction.

But you see here, using DC Terms in the (in my view) intended way of
hierarchical arrangement, there is no direct link between the arxiv preprint
and the published journal paper.  

Sometimes the representation and its more abstract resource has the same
identifier (e.g. there is no ".html" extension, or there is no URI for the
published article in any format - this is a variant of the HTTP Range 14
problem I would not delve too much into, and instead simply drop "@id" on those
that do not have a URI).

The dcterms properties are still a bit too loose for my liking, as they don't really 
express the relationship much beyond a "kind of sameness", and has unclear provenance
directionality.


## PROV-O


The PROV-O ontology <http://www.w3.org/TR/prov-o/> is obviously relevant as to 
the provenance aspect, e.g. the publisher website can be seen to be prov:wasDerivedFrom the 
arxiv preprint, or even prov:wasRevisionOf.  The statements here become a bit murky 
provenance-wise if you probe hard enough, because the publisher version is not truly 
based on the arXiv preprint - but I guess you generally don't want to be involving 
the third (evolving) copy of the article as say a .docx file on someone's laptop. 

The detail level to use here depends on what provenance you have and what is relevant, 
e.g. if you are detailing the article at different submission stages you might 
have quite a bit more information than if you only have preprint and published
article. I'll assume the second:

{ "@context": { "prov": "http://www.w3.org/ns/prov#" },
  "@id": "http://journal.example.com/article15",
  "prov:wasRevisionOf": { "@id": "http://arxiv.org/abs/123.45" } 
}

One issue with prov:wasRevisionOf or prov:wasDerivedFrom is that it is
not required to be "strictly previous version" - it's just pointing at 
"some" older version in a way.  Thus you might find other cases where you have
more intermediates:

{ "@context": { "prov": "http://www.w3.org/ns/prov#" },
  "@id": "http://journal.example.com/article15",
  "prov:wasRevisionOf" { "@id": "http://journal.example.com/submitted-for-peer-review/15" }
}


PROV adds another very important aspect with prov:specializationOf and
prov:alternateOf - this means you can relate a generic resource with a more
specific one, e.g. you express that relationship between DOI and the published
article, or the article and its PDF representation.  In PROV specialisations,
what is said about the more general resource should also be true
about the more specific resource, e.g. the authors and title should be the same.

A prov:alternateOf means a resource which has the same general resource as
this one.


{ "@context": { "prov": "http://www.w3.org/ns/prov#" },
  "@id": "http://journal.example.com/article15",
  "prov:specializationOf" { "@id": "http://dx.doi.org/10.999/1.2.3.4" },
  "prov:alternateOf": "http://arxiv.org/abs/123.45"  
}

The inverses prov:generalizationOf can be used for the opposite direction. Here we 
use this to present the representation formats:


{ "@context": { "prov": "http://www.w3.org/ns/prov#" },
  "@id": "http://journal.example.com/article15",
  "prov:generalizationOf": [
    { "@id": "http://journal.example.com/article15.pdf" },
    { "@id": "http://journal.example.com/article15.html" },
    { "@id": "http://journal.example.com/article15.epub" }
  ]
}


## PAV

In the PAV ontology <http://purl.org/pav/> we try to make common 
bibliographical provenance patterns for web resources
more easily expressed. PAV is mapped to PROV, so you can make the statements
above more specific using PAV.



{ "@context": { "pav": "http://purl.org/pav/" },
  "@id": "http://journal.example.com/article15",
  "pav:previousVersion": "http://arxiv.org/abs/123.45"
}


PAV provides retrieval/import statements, which are very useful when an article has 
re-appeared in a differnet system with a different URI, and possibly in a different format.

So you could add a PUBMED record as:

{ "@context": { "pav": "http://purl.org/pav/", "prov": "http://www.w3.org/ns/prov#" },
  "@id": "http://www.ncbi.nlm.nih.gov/pubmed/1234",
  "pav:importedFrom": "http://journal.example.com/article15",
  "prov:specializationOf": "http://dx.doi.org/10.999/1.2.3.4" }

Or a redistribution of the publisher's Open Access PDF as:

{ "@context": { "pav": "http://purl.org/pav/", "prov": "http://www.w3.org/ns/prov#" },
  "@id": "http://home.example.com/~alice/mypaper.pdf",
  "pav:retrievedFrom": "http://journal.example.com/article15.pdf",
  "prov:specializationOf": "http://dx.doi.org/10.999/1.2.3.4"
}

(For added fun in getting the provenance lineage straight, redistribute the
publisher PDF on arxiv! :))

### PAV versions

To counter the DC Terms confusion, PAV provides a more specific hierarchical
pav:hasVersion that only is used to relate "version-versions", e.g. v2.1.2 
version. pav:hasCurrentVersion is a way to point to the authorative 
current version (at time of writing).  So if the publication system has URIs
for different stages of the article, this can be used to provide a perma-link to 
whatever version you are currently returning:

On arXiv every version of the upload are available, with different URIs for each 
version of both the entry (abstract) and representation (PDF).

Here combining with dcterms:hasFormat is easy, although detailing every version
of course gets verbose:

{ "@context": { "pav": "http://purl.org/pav/", "dcterms": "http://purl.org/dc/terms" },

  "@id": "http://arxiv.org/abs/1304.7224",
  "prov:specializationOf": { "@id": "http://dx.doi.org/10.1186/2041-1480-4-37" },
  "pav:hasCurrentVersion": 
      { "@id": "http://arxiv.org/abs/1304.7224v6",
        "pav:previousVersion": { "@id": "http://arxiv.org/abs/1304.7224v5" },
        "dcterms:hasVersion": { "@id": "http://arxiv.org/pdf/1304.7224v6,pdf" }
      },
  "dcterms:hasVersion": 
     { "@id": "http://arxiv.org/pdf/1304.7224",
       "pav:hasCurrentVersion": "http://arxiv.org/pdf/1304.7224v6.pdf"                      
     }
}

The PAV version properties are subproperties
of PROV properties prov:generalizationOf, prov:alternateOf and prov:wasRevisionOf
so those can be implied.


## Summary

So exactly what properties are best for you to use depends a bit on what 
you mean with "version" :)


-- 
Stian Soiland-Reyes, eScience Lab
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/    http://orcid.org/0000-0001-9842-9718
Received on Wednesday, 21 October 2015 11:06:27 UTC