Re: JATS (was: Early draft is up) from Peter Murray-Rust on 2016-03-21 (public-scholarlyhtml@w3.org from March 2016)

From: Peter Murray-Rust <pm286@cam.ac.uk>
Date: Mon, 21 Mar 2016 17:41:27 +0000
To: Robin Berjon <robin@berjon.com>
Cc: Gareth Oakes <goakes@gpsl.co>, W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>
Message-ID: <CAD2k14OUonBYKzbv7Z+3xBjCt_jmCVzf6O0NSv4+znf6LLB8Ew@mail.gmail.com>

I think this is a very useful analysis both strategically and technically.

What I am doing certainly stress the JATS model. The intention is to
consume varied JATS from EuropePMC - over a million and turn them into
computable documents. SH will be critical in narrowing the semantics.

So far I have found ca 215 element tags in probably about a thousand
documents. I'm actually working in Java but the code is simple enough to be
easily ported I think. The SH is used as the primary substrate, not least
because it can be displayed and annotated (we are working very closely with
Hypothes.is - and through them - the W3C annotation spec). I expect that
this will make searches rather fuzzy because authors' semantics are. (We
have "Materials", Materials and Methods" , "methodology", "experimental"
etc.). At this stage I am concentrating on precision rather than recall -
we may miss some sections because their labels are unclear . (And I doubt
that we want to come up with a standard mapping of section headings - it
wouldn't be used anyway).

One early output should be a list of actually what JATS  tags are most
commonly used and what linguistic labels are given to them.

On Mon, Mar 21, 2016 at 2:15 PM, Robin Berjon <robin@berjon.com> wrote:

>
> [analysis snipped]

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Received on Monday, 21 March 2016 17:41:56 UTC