- From: Peter Murray-Rust <pm286@cam.ac.uk>
- Date: Mon, 21 Mar 2016 17:41:27 +0000
- To: Robin Berjon <robin@berjon.com>
- Cc: Gareth Oakes <goakes@gpsl.co>, W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>
Received on Monday, 21 March 2016 17:41:56 UTC
I think this is a very useful analysis both strategically and technically. What I am doing certainly stress the JATS model. The intention is to consume varied JATS from EuropePMC - over a million and turn them into computable documents. SH will be critical in narrowing the semantics. So far I have found ca 215 element tags in probably about a thousand documents. I'm actually working in Java but the code is simple enough to be easily ported I think. The SH is used as the primary substrate, not least because it can be displayed and annotated (we are working very closely with Hypothes.is - and through them - the W3C annotation spec). I expect that this will make searches rather fuzzy because authors' semantics are. (We have "Materials", Materials and Methods" , "methodology", "experimental" etc.). At this stage I am concentrating on precision rather than recall - we may miss some sections because their labels are unclear . (And I doubt that we want to come up with a standard mapping of section headings - it wouldn't be used anyway). One early output should be a list of actually what JATS tags are most commonly used and what linguistic labels are given to them. On Mon, Mar 21, 2016 at 2:15 PM, Robin Berjon <robin@berjon.com> wrote: > > [analysis snipped] -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
Received on Monday, 21 March 2016 17:41:56 UTC