Re: Early draft is up from Robin Berjon on 2016-03-18 (public-scholarlyhtml@w3.org from March 2016)

From: Robin Berjon <robin@berjon.com>
Date: Fri, 18 Mar 2016 10:42:26 -0400
To: Peter Murray-Rust <pm286@cam.ac.uk>
Cc: W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>, all@contentmine.org
Message-ID: <56EC13D2.80100@berjon.com>
On 17/03/2016 20:57 , Peter Murray-Rust wrote:
> For the record I am currently working on converting JATS-XML into
> ScholarlyHTML. This is the primary resource that we use for mining
> science.

Ah, then we should probably talk :) (But maybe offline? I'm not sure how
much the rest of the group is interested in JATS-extraction.)

A couple of weeks ago I released "dejats"
(https://github.com/scienceai/dejats), a JS tool that converts JATS to HTML.

It is *not* a converter to Scholarly HTML, but it is meant to enable one.

The motivate behind dejats (and the coming dedocx) is that existing
conversion tooling and pipelines for intricate formats such as JATS (or
docx, LaTeX, etc.) tend to be very inflexible. They make assumptions
about what should be extracted and drop information on the floor.
Changing their behaviour typically requires reaching into the code
directly, or in the best cases making use of an API as intricate as the
format.

The theory to replace that is to have a first step that carries out a
conversion from the original format into HTML in the *dumbest and
stupidest* way possible (something which I believe I've done quite
excellently, if I do say so myself). Once you've produced a very dumb
HTML DOM from the source, you pass it successively to a sequence of
small and very simple tools that each gets the same DOM in turn and that
each modify it in a straightforward and well-contained manner —
essentially the Unix philosophy of pipes of small tools applied to an
HTML DOM. One tool might make the title markup right, another will
extract the journal metadata.

When you want to carry out a conversion, you pipe together the steps you
need. It's easy to share code and reuse the work of others. This works
quite well for formats that have a high degree of variability — like JATS.

I haven't yet released tools that work with dejats but internally I
already have four: managing the title, handling journal metadata,
handling article metadata, and transforming the sections of an article
in a (hopefully) sane way. It won't be today, but I'd be happy to give
them the clean up they need to start releasing them next week.

> At present it's just XHTML which is well formed and with some
> degree of normalization, but there is some variation in the publishers'
> markup.

There can be a lot of variability in JATS. There's a reason for that:
it's meant to be a target format, and as such has to adapt to a fair
amount of variability in input. This is great to get things into, it can
make it hard to transform out of. In a way, the essential difference
between JATS and SH is that SH is also a target format but is meant to
the *final* step format (such that transformation out of it ought not be
necessary) and to have its metadata extractable through tooling that is
largely insensitive to structure (RDFa).

> Do we plan to have a validator for ScholarlyHTML?

Yes, though I haven't had time to make this a priority. I want to go
through the spec and describe the validation much more strictly, which
ought to help anyone who wants to do that (I'll get around to
implementing a validator if no one beats me to it, but I'm certainly
happy for less work :).

> Anyway I'll let you know how we get on and feed early drafts of
> converted SHTML - which will almost certainly have serious errors...

I look forward to it!

> There are over 1 million documents to practice on!

It sounds like you're converting PubMed :)

-- 
• Robin Berjon - http://berjon.com/ - @robinberjon
• http://science.ai/ — intelligent science publishing
•
Received on Friday, 18 March 2016 14:42:54 UTC