- From: Robin Berjon <robin@berjon.com>
- Date: Fri, 18 Mar 2016 10:42:26 -0400
- To: Peter Murray-Rust <pm286@cam.ac.uk>
- Cc: W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>, all@contentmine.org
On 17/03/2016 20:57 , Peter Murray-Rust wrote: > For the record I am currently working on converting JATS-XML into > ScholarlyHTML. This is the primary resource that we use for mining > science. Ah, then we should probably talk :) (But maybe offline? I'm not sure how much the rest of the group is interested in JATS-extraction.) A couple of weeks ago I released "dejats" (https://github.com/scienceai/dejats), a JS tool that converts JATS to HTML. It is *not* a converter to Scholarly HTML, but it is meant to enable one. The motivate behind dejats (and the coming dedocx) is that existing conversion tooling and pipelines for intricate formats such as JATS (or docx, LaTeX, etc.) tend to be very inflexible. They make assumptions about what should be extracted and drop information on the floor. Changing their behaviour typically requires reaching into the code directly, or in the best cases making use of an API as intricate as the format. The theory to replace that is to have a first step that carries out a conversion from the original format into HTML in the *dumbest and stupidest* way possible (something which I believe I've done quite excellently, if I do say so myself). Once you've produced a very dumb HTML DOM from the source, you pass it successively to a sequence of small and very simple tools that each gets the same DOM in turn and that each modify it in a straightforward and well-contained manner — essentially the Unix philosophy of pipes of small tools applied to an HTML DOM. One tool might make the title markup right, another will extract the journal metadata. When you want to carry out a conversion, you pipe together the steps you need. It's easy to share code and reuse the work of others. This works quite well for formats that have a high degree of variability — like JATS. I haven't yet released tools that work with dejats but internally I already have four: managing the title, handling journal metadata, handling article metadata, and transforming the sections of an article in a (hopefully) sane way. It won't be today, but I'd be happy to give them the clean up they need to start releasing them next week. > At present it's just XHTML which is well formed and with some > degree of normalization, but there is some variation in the publishers' > markup. There can be a lot of variability in JATS. There's a reason for that: it's meant to be a target format, and as such has to adapt to a fair amount of variability in input. This is great to get things into, it can make it hard to transform out of. In a way, the essential difference between JATS and SH is that SH is also a target format but is meant to the *final* step format (such that transformation out of it ought not be necessary) and to have its metadata extractable through tooling that is largely insensitive to structure (RDFa). > Do we plan to have a validator for ScholarlyHTML? Yes, though I haven't had time to make this a priority. I want to go through the spec and describe the validation much more strictly, which ought to help anyone who wants to do that (I'll get around to implementing a validator if no one beats me to it, but I'm certainly happy for less work :). > Anyway I'll let you know how we get on and feed early drafts of > converted SHTML - which will almost certainly have serious errors... I look forward to it! > There are over 1 million documents to practice on! It sounds like you're converting PubMed :) -- • Robin Berjon - http://berjon.com/ - @robinberjon • http://science.ai/ — intelligent science publishing •
Received on Friday, 18 March 2016 14:42:54 UTC