Re: JATS (was: Early draft is up) from Gareth Oakes on 2016-03-21 (public-scholarlyhtml@w3.org from March 2016)

From: Gareth Oakes <goakes@gpsl.co>
Date: Mon, 21 Mar 2016 22:34:21 +0000
To: Peter Murray-Rust <pm286@cam.ac.uk>
CC: W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>
Message-ID: <65E179D3-21A3-4E4B-965A-BAB5CDFF7222@gpsl.co>

Hi Peter,

> What I am doing certainly stress the JATS model. The intention is to consume varied JATS from EuropePMC -
> over a million and turn them into computable documents. SH will be critical in narrowing the semantics.

Very cool. I also think SH has a role to play in being able to expand the machine-accessible knowledge base beyond the full text articles and into supplementary materials, research results, etc.

> I expect that this will make searches rather fuzzy because authors' semantics are.
> (We have "Materials", Materials and Methods" , "methodology", "experimental" etc.).

Yes you have to draw the line at what level of content “intelligence” you wish to serve up. For example you can deliver content that is identified to a section level, but if your retrieval API is developed cleverly enough, it could be used to drive a machine learning system which can be trained to recognise and return results that are relevant to the particular user or their query.

I guess what I’m trying to say is that it’s clear that SH can’t be used to model a complete, cohesive, semantic database of scholarly content. However the promise is that we will be able to get much closer than we are today.

(Side note: are there analogies between SH and the goals of standards like DITA/S1000D to deliver the notion of an interoperable “content supply chain”?)

> One early output should be a list of actually what JATS tags are most commonly used and what linguistic labels are given to them.

Possibly venturing into NLP and machine learning territory?

// Gareth Oakes
// Chief Architect, GPSL
// www.gpsl.co

Received on Monday, 21 March 2016 22:34:52 UTC