Re: JATS (was: Early draft is up) from Robin Berjon on 2016-03-21 (public-scholarlyhtml@w3.org from March 2016)

From: Robin Berjon <robin@berjon.com>
Date: Mon, 21 Mar 2016 10:15:39 -0400
To: Gareth Oakes <goakes@gpsl.co>
Cc: W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>
Message-ID: <56F0020B.2050608@berjon.com>
Hi Gareth,

On 20/03/2016 18:43 , Gareth Oakes wrote:
>> A couple of weeks ago I released "dejats" 
>> (https://github.com/scienceai/dejats), a JS tool that converts JATS
>> to HTML.
> 
> Looks like a sensible tool. Sorry this might be getting off topic,
> but for this application I’m interested in the technology choice of
> Javascript over XSLT, if you are able to elaborate.
> 
> (We find XSLT quite productive for transformations involving XML
> inputs)

I have no problem with XSLT, I wrote quite a lot of it in a previous
life. But I would like our tooling to work both in Node and the browser.
XSLT support in the browser can be tricky (at best) and may get yanked
out at some point. When you do get support it's v1-only anyway, which
makes reusing existing XSLT harder since much of what's out there is at
least v2.

XSLT support in Node is, if anything, possibly worse. There are options
but all of those I've tried either segfault easily or involve a fair bit
of manual tweaking to set up (which doesn't play well with deployment).

It ended up being simpler to just mimic the subset of the XSLT
templating matching algorithm that I needed. The heart of XSLT is pretty
simple, it's all the additional niceties that are hard to implement. The
core processing model for v1 fit into a single paragraph :) (cf.
https://www.w3.org/TR/xslt/#section-Processing-Model)

I know that Saxonica recently announced a new XSLT to JS compiler. I'll
certainly be looking at it.

>> There can be a lot of variability in JATS. There's a reason for
>> that: it's meant to be a target format, and as such has to adapt to
>> a fair amount of variability in input. This is great to get things
>> into, it can make it hard to transform out of. In a way, the
>> essential difference between JATS and SH is that SH is also a
>> target format but is meant to the *final* step format (such that
>> transformation out of it ought not be necessary) and to have its
>> metadata extractable through tooling that is largely insensitive to
>> structure (RDFa).
> 
> I still think there will be a variability in the amount of richness
> that SH articles will be able to provide. Publishers may or may not
> have content with a complete or consistent set of semantic
> information. Silly things like whether addresses are marked up
> properly, surnames/given-names correctly identified, <mixed-citation>
> vs <element-citation> use, whether author-supplied references are
> checked & corrected, citation styles, etc.

Absolutely, and those silly things add up quickly.

I think that the way to approach is this:

  1) If you have the structured data, then you must encode it like this.
(No need for variability when you do have the information.)

  2) If you only have the text, well, give us the text.

One of the neat things with RDFa is that at the processing level we can
have interoperability without having to care too much about structure.
The body of the article needs to be relatively regular in terms of
sections+hunks, but the rest can be pretty creative and you can still
get a nice JSON-LD tree out of it.

> Obviously you can force standardisation and a minimal level of
> compliance, but that works against acceptance by potential users of
> SH. Could/should SH provide one standard that all publishers meet? Is
> multi-level compliance like JATS green/blue/orange a consideration?
> Or an extra level of conformance like JATS4R?

My experience with standards is that multiple levels of conformance
create more problems than they solve. (JATS4R is sort of different in
that regard in that it is more trying to solve that problem more than
adding to it, but that's the general idea at least).

My current thinking is that when you process an SH document, you
actually get two tree. One is the article tree, that is basically little
more than title/hunks/sections with sections containing that
recursively. The other is a metadata tree, which is basically the
JSON-LD tree rooted at the article resource (in JSON-LD terms, the graph
of the article is framed into the article resource).

Both trees have identifiers that make it possible (even relatively easy)
to merge them back together. What we do is that we store both separately
(largely so that the metadata can be edited and enriched on its own) and
then we have a React component that just merges them back together.

Obviously, if instead of a tree you want an RDF graph nothing prevents.
Personally I've found it easier to work with two trees (even though they
are not isomorphic), and in fact I never actually work at the RDF level,
but YMMV! The core idea is that RDFa allows for a lot of variability
without impacting processing, and makes it possible to work with
whatever tool chain you like.

-- 
• Robin Berjon - http://berjon.com/ - @robinberjon
• http://science.ai/ — intelligent science publishing
•
Received on Monday, 21 March 2016 14:16:04 UTC