Re: article versions with JSON-LD

Thanks everyone for the comments!

It seems that there are a quite a few ways to express the semantics I
described. If I understand correctly, some of the mentioned ontologies are
actively being developed, some others are already quite mature, and there's
no absolute best choice for us.
In the end, our interest for https://paperhive.org is to make the site as
useful as possible for our users, especially from the scientific community.
I will discuss the matter with my colleagues and get back to you guys if we
need any more guidance.

Thanks again, cheers,
Nico



On Wed, Oct 21, 2015 at 1:07 PM Stian Soiland-Reyes <
soiland-reyes@cs.manchester.ac.uk> wrote:

> Excerpts from Nico Schlömer's message of 2015-10-19 17:22:02 +0100:
> > A scientific article is typically published in several revisions,
> > e.g., a bunch of revisions on a preprint server like arXiv [1] plus
> > possibly a version somewhere on a publisher's website [2]. The
> > versions will generally be perceived as "the same article", but differ
> > a little bit here and there. I'd hesitate to refer to them as the same
> > document.
>
> "The same" is almost never actually the same. It's generallty a matter of
> different abstractions and different representations
> and descriptions which might have varying timespans.
>
>
> As pointed out earlier there are various vocabularies that deal with this
> in
> different forms. I'll add some more to the list.. sorry!
>
>
> ## Fabio
>
> For publications I would look at the SPAR ontologies <
> http://www.sparontologies.net/>
> which include FaBIO <http://www.sparontologies.net/ontologies/fabio>,
> a mapping of the well-known FRBR model
> (Work/Expression/Manifestation/Item) as
> used by libraries.
>
> A particular Work can have many Expressions,
> e.g. a presentation, a journal article, a poster paper.
> Identifying the work is usually hard as this is a higher-level abstraction
> that
> is often not tracked on its own.
>
> My take on using FaBIO is with an anonymous top-level fabio:ResearchPaper:
>
> { "@context": { "frbr": "http://purl.org/vocab/frbr/core#",
>                 "prism": "http://prismstandard.org/namespaces/basic/2.0/",
>                 "fabio": "http://purl.org/spar/fabio/" },
>   "@type": "fabio:ResearchPaper",
>   "frbr:realization": [
>     {  "@id": "http://arxiv.org/abs/123.45",
>        "@type": "fabio:Preprint"
>     },
>     {
>        "@id": "http://dx.doi.org/10.999/1.2.3.4",
>        "@type": "fabio:JournalArticle",
>        "prism:doi": "10.999/1.2.3.4",
>         "frbr:embodiment": [
>             { "@id": "http://journal.example.com/article15.pdf",
>               "@type": "fabio:DigitalManifestation"
>             },
>             { "@id": "http://journal.example.com/article15.epub",
>               "@type": "fabio:DigitalManifestation"
>             },
>             { "@type": "fabio:PrintObject",
>               "frbr:hasExemplar": {
>                   "@id": "http:/library.example.org/physical/152"
>               }
>             }
>         ]
>     }
> }
>
>
> FaBIO allows you to provide fine distinctions such as preprint vs
> postprints,
> definitve versions, ordered author lists, etc, and its background from
> libraries means it's also easy to relate to physical objects like a
> printed journal exemplar in a particular library.
>
>
> ## Dublin Core Terms
>
> I would warn about what first might seem like an obvious choice from the
> Dublin Core Terms <http://purl.org/dc/terms/> - which has
> dcterms:isVersionOf and dcterms:isFormatOf (and their inverse
> dcterms:hasVersion / dcterms:hasFormat ) were meant for - sadly these
> properties have been misused by users who don't read the descriptions, in
> that
> "hasVersion" have been misunderstood to only point to a versioned snapshot
> description of the same resource (e.g. not a variation-version).
>
> Similarly dcterms:hasFormat ("A related resource that is substantially the
> same as the pre-existing
> described resource, but in another format.") has been misunderstood to
> point to
> a format definition rather than the resource in a particular format
> (dcterms:format).
>
>
> I would use DC Terms somewhat like this:
>
> { "@context": { "dcterms": "http://purl.org/dc/terms/" },
>   "@id": "http://dx.doi.org/10.999/1.2.3.4",
>   "dcterms:hasVersion": [
>       { "@id": "http://journal.example.com/article15",
>         "dcterms:hasFormat": [
>             { "@id": "http://journal.example.com/article15.pdf" },
>             { "@id": "http://journal.example.com/article15.html" },
>             { "@id": "http://journal.example.com/article15.epub" }
>         ],
>         "dcterms:isVersionOf":  { "@id": "http://dx.doi.org/10.999/1.2.3.4"
> }
>       },
>       { "@id": "http://arxiv.org/abs/123.45",
>         "dcterms:hasFormat": { "@id": "http://arxiv.org/pdf/123.45.pdf" },
>         "dcterms:isVersionOf":  { "@id": "http://dx.doi.org/10.999/1.2.3.4"
> }
>       }
>    ]
> }
>
>
> Obviously you can structure this top-down or bottom-down to your liking
> (perhaps reflecting better the HTTP JSON-LD resource that was requested),
> and
> use dcterms:is*Of or dcterms:has* depending on the direction.
>
> But you see here, using DC Terms in the (in my view) intended way of
> hierarchical arrangement, there is no direct link between the arxiv
> preprint
> and the published journal paper.
>
> Sometimes the representation and its more abstract resource has the same
> identifier (e.g. there is no ".html" extension, or there is no URI for the
> published article in any format - this is a variant of the HTTP Range 14
> problem I would not delve too much into, and instead simply drop "@id" on
> those
> that do not have a URI).
>
> The dcterms properties are still a bit too loose for my liking, as they
> don't really
> express the relationship much beyond a "kind of sameness", and has unclear
> provenance
> directionality.
>
>
> ## PROV-O
>
>
> The PROV-O ontology <http://www.w3.org/TR/prov-o/> is obviously relevant
> as to
> the provenance aspect, e.g. the publisher website can be seen to be
> prov:wasDerivedFrom the
> arxiv preprint, or even prov:wasRevisionOf.  The statements here become a
> bit murky
> provenance-wise if you probe hard enough, because the publisher version is
> not truly
> based on the arXiv preprint - but I guess you generally don't want to be
> involving
> the third (evolving) copy of the article as say a .docx file on someone's
> laptop.
>
> The detail level to use here depends on what provenance you have and what
> is relevant,
> e.g. if you are detailing the article at different submission stages you
> might
> have quite a bit more information than if you only have preprint and
> published
> article. I'll assume the second:
>
> { "@context": { "prov": "http://www.w3.org/ns/prov#" },
>   "@id": "http://journal.example.com/article15",
>   "prov:wasRevisionOf": { "@id": "http://arxiv.org/abs/123.45" }
> }
>
> One issue with prov:wasRevisionOf or prov:wasDerivedFrom is that it is
> not required to be "strictly previous version" - it's just pointing at
> "some" older version in a way.  Thus you might find other cases where you
> have
> more intermediates:
>
> { "@context": { "prov": "http://www.w3.org/ns/prov#" },
>   "@id": "http://journal.example.com/article15",
>   "prov:wasRevisionOf" { "@id": "
> http://journal.example.com/submitted-for-peer-review/15" }
> }
>
>
> PROV adds another very important aspect with prov:specializationOf and
> prov:alternateOf - this means you can relate a generic resource with a more
> specific one, e.g. you express that relationship between DOI and the
> published
> article, or the article and its PDF representation.  In PROV
> specialisations,
> what is said about the more general resource should also be true
> about the more specific resource, e.g. the authors and title should be the
> same.
>
> A prov:alternateOf means a resource which has the same general resource as
> this one.
>
>
> { "@context": { "prov": "http://www.w3.org/ns/prov#" },
>   "@id": "http://journal.example.com/article15",
>   "prov:specializationOf" { "@id": "http://dx.doi.org/10.999/1.2.3.4" },
>   "prov:alternateOf": "http://arxiv.org/abs/123.45"
> }
>
> The inverses prov:generalizationOf can be used for the opposite direction.
> Here we
> use this to present the representation formats:
>
>
> { "@context": { "prov": "http://www.w3.org/ns/prov#" },
>   "@id": "http://journal.example.com/article15",
>   "prov:generalizationOf": [
>     { "@id": "http://journal.example.com/article15.pdf" },
>     { "@id": "http://journal.example.com/article15.html" },
>     { "@id": "http://journal.example.com/article15.epub" }
>   ]
> }
>
>
> ## PAV
>
> In the PAV ontology <http://purl.org/pav/> we try to make common
> bibliographical provenance patterns for web resources
> more easily expressed. PAV is mapped to PROV, so you can make the
> statements
> above more specific using PAV.
>
>
>
> { "@context": { "pav": "http://purl.org/pav/" },
>   "@id": "http://journal.example.com/article15",
>   "pav:previousVersion": "http://arxiv.org/abs/123.45"
> }
>
>
> PAV provides retrieval/import statements, which are very useful when an
> article has
> re-appeared in a differnet system with a different URI, and possibly in a
> different format.
>
> So you could add a PUBMED record as:
>
> { "@context": { "pav": "http://purl.org/pav/", "prov": "
> http://www.w3.org/ns/prov#" },
>   "@id": "http://www.ncbi.nlm.nih.gov/pubmed/1234",
>   "pav:importedFrom": "http://journal.example.com/article15",
>   "prov:specializationOf": "http://dx.doi.org/10.999/1.2.3.4" }
>
> Or a redistribution of the publisher's Open Access PDF as:
>
> { "@context": { "pav": "http://purl.org/pav/", "prov": "
> http://www.w3.org/ns/prov#" },
>   "@id": "http://home.example.com/~alice/mypaper.pdf",
>   "pav:retrievedFrom": "http://journal.example.com/article15.pdf",
>   "prov:specializationOf": "http://dx.doi.org/10.999/1.2.3.4"
> }
>
> (For added fun in getting the provenance lineage straight, redistribute the
> publisher PDF on arxiv! :))
>
> ### PAV versions
>
> To counter the DC Terms confusion, PAV provides a more specific
> hierarchical
> pav:hasVersion that only is used to relate "version-versions", e.g. v2.1.2
> version. pav:hasCurrentVersion is a way to point to the authorative
> current version (at time of writing).  So if the publication system has
> URIs
> for different stages of the article, this can be used to provide a
> perma-link to
> whatever version you are currently returning:
>
> On arXiv every version of the upload are available, with different URIs
> for each
> version of both the entry (abstract) and representation (PDF).
>
> Here combining with dcterms:hasFormat is easy, although detailing every
> version
> of course gets verbose:
>
> { "@context": { "pav": "http://purl.org/pav/", "dcterms": "
> http://purl.org/dc/terms" },
>
>   "@id": "http://arxiv.org/abs/1304.7224",
>   "prov:specializationOf": { "@id": "
> http://dx.doi.org/10.1186/2041-1480-4-37" },
>   "pav:hasCurrentVersion":
>       { "@id": "http://arxiv.org/abs/1304.7224v6",
>         "pav:previousVersion": { "@id": "http://arxiv.org/abs/1304.7224v5"
> },
>         "dcterms:hasVersion": { "@id": "
> http://arxiv.org/pdf/1304.7224v6,pdf" }
>       },
>   "dcterms:hasVersion":
>      { "@id": "http://arxiv.org/pdf/1304.7224",
>        "pav:hasCurrentVersion": "http://arxiv.org/pdf/1304.7224v6.pdf"
>      }
> }
>
> The PAV version properties are subproperties
> of PROV properties prov:generalizationOf, prov:alternateOf and
> prov:wasRevisionOf
> so those can be implied.
>
>
> ## Summary
>
> So exactly what properties are best for you to use depends a bit on what
> you mean with "version" :)
>
>
> --
> Stian Soiland-Reyes, eScience Lab
> School of Computer Science
> The University of Manchester
> http://soiland-reyes.com/stian/work/
> http://orcid.org/0000-0001-9842-9718
>
>

Received on Friday, 23 October 2015 18:29:07 UTC