Re: Early draft is up from Alf Eaton on 2016-03-18 (public-scholarlyhtml@w3.org from March 2016)

From: Alf Eaton <eaton.alf@gmail.com>
Date: Fri, 18 Mar 2016 15:09:03 +0000
To: Robin Berjon <robin@berjon.com>
Cc: W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>
Message-ID: <CAJVrAaT3BvJo=T76iMy_6o96NPmwFhTY=-HwSh-OM73Bhoi_rA@mail.gmail.com>

On 18 March 2016 at 14:42, Robin Berjon <robin@berjon.com> wrote:
>
> A couple of weeks ago I released "dejats"
> (https://github.com/scienceai/dejats), a JS tool that converts JATS to HTML.
>
> It is *not* a converter to Scholarly HTML, but it is meant to enable one.
>
> The motivate behind dejats (and the coming dedocx) is that existing
> conversion tooling and pipelines for intricate formats such as JATS (or
> docx, LaTeX, etc.) tend to be very inflexible. They make assumptions
> about what should be extracted and drop information on the floor.
> Changing their behaviour typically requires reaching into the code
> directly, or in the best cases making use of an API as intricate as the
> format.
>
> The theory to replace that is to have a first step that carries out a
> conversion from the original format into HTML in the *dumbest and
> stupidest* way possible (something which I believe I've done quite
> excellently, if I do say so myself). Once you've produced a very dumb
> HTML DOM from the source, you pass it successively to a sequence of
> small and very simple tools that each gets the same DOM in turn and that
> each modify it in a straightforward and well-contained manner —
> essentially the Unix philosophy of pipes of small tools applied to an
> HTML DOM. One tool might make the title markup right, another will
> extract the journal metadata.
>
> When you want to carry out a conversion, you pipe together the steps you
> need. It's easy to share code and reuse the work of others. This works
> quite well for formats that have a high degree of variability — like JATS.
>
> I haven't yet released tools that work with dejats but internally I
> already have four: managing the title, handling journal metadata,
> handling article metadata, and transforming the sections of an article
> in a (hopefully) sane way. It won't be today, but I'd be happy to give
> them the clean up they need to start releasing them next week.

I like this idea a lot.

At PeerJ we use a single transformation[1] to convert JATS to HTML,
starting similarly from a default div/span with class=the original
element name, but have gradually added special cases where there are
more appropriate elements, and then some later transformation on the
parsed DOM document.

[1] https://github.com/PeerJ/jats-conversion/blob/master/src/data/xsl/jats-to-html.xsl

Received on Friday, 18 March 2016 15:11:21 UTC