W3C home > Mailing lists > Public > public-scholarlyhtml@w3.org > September 2017

Re: html for scholarly communication: RASH, Scholarly HTML or Dokieli?

From: Rzepa, Henry S <h.rzepa@imperial.ac.uk>
Date: Sun, 10 Sep 2017 07:35:26 +0000
To: Peter Murray-Rust <pm286@cam.ac.uk>
CC: Johannes Wilm <mail@johanneswilm.org>, Ivan Herman <ivan@w3.org>, "Benjamin Young" <byoung@bigbluehat.com>, Silvio Peroni <silvio.peroni@unibo.it>, "Robin Berjon" <robin@berjon.com>, W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>, "Peter (pt) Sefton" <ptsefton@gmail.com>, Brian McMahon <bm@iucr.org>, Martin Fenner <mf@martinfenner.org>
Message-ID: <D99FAE07-3C0B-4274-A9DF-8CF11A932806@imperial.ac.uk>
From my own perspective of writing articles in chemistry,  I have done this using  HTML for 20+ years now, converting only to Word at the last stage so that my “co-authors” can then contribute,   I have put at least as much effort into the CSS as  I have the  HTML.  Please excuse if  CSS is discussed in this form as much as HTML, but  I did not see it mentioned below. So we need a software authoring environment that supports  not only HTML,  but also  the  rich CSS set and any “small extensions” as noted below.  I have used just Web browsers to achieve this throughout.

Peter also throws in  "So I just want standard HTML, with a small set of scholarly tags…” Again, excuses for not seeing earlier discussions, but what is the expected support for this small set?  As Martin Fenner describes in his review, my most important extension is largely nicely implemented via  the WordPress extension Kcite, which takes a  DOI, expands it to a full conventional journal citation via Crossref query and then auto-numbers it at the end of the article. But this only works within  WordPress.  DOI handling is  (IMHO) becoming ever more important, since  I view PIDs for data (and other research objects) as just as if not more important as  DOIs for articles (both representing valuable carriers of metadata).  As it happens, Kcite for data has now been broken for about a year, and I much regret that.

I finally add that  again for  20+ years,  I author all my presentations in  HTML and nowadays sprinkle PIDs liberally throughout, including one for the presentation itself.  Whilst mining presentations is probably a lower quality source of data than articles themselves, it might have short term value (if not long term persistence). I do this perhaps in the hope that  e.g. Event Data might help make connections.

PS Martin Fenner; are there two  or just one (now with DataCite?). Hence the use of  ORCID!

Can you take one more  PS? I am in the middle of a communal science project about chemical bonds, involving about  15 co-authors, using not  WordPress but  MediaWiki to author the article.  Perhaps easier than persuading them all to learn  HTML+CSS;  https://bondslam.dipc.org

Henry Rzepa, http://orcid.org/0000-0002-8635-8390

> On 8 Sep 2017, at 13:13, Peter Murray-Rust <pm286@cam.ac.uk> wrote:
> (copying some of my erstwhile collaborators and originators of "ScholarlyHTML" 6 years ago.)
> For starters I am happy to lend my efforts to whatever emerges from this.
> When proposing ScholarlyHTML in 2011 (in the Panton Arms pub in Cambridge) we were conscious that HTML should be used as a simple, powerful, well-supported basis for scholarly publications (not specifically STEM, but probably with a tendency towards it). NLM-DTD (JATS XML as it now is) is too complex for the average scientist and is aimed at publishers, repository managers, archivers, etc.). The group included 
>  * Brian McMahon from IUCr (who have developed the best semantic scientific publishing engine anywhere - CIF - in the world for data-rich science). CIF works, is used and is even loved by many. CIF is fairly isomorphous with HTML - i.e. there can be lossless machine interchange.
>  * Peter (PT) Sefton , who in Univ Southern Queensland (USQ) developed ICE - the best scholarly authoring tool which was actually used and loved by practising academics
>  * Henry Rzepa and PMR, who have developed XML for chemistry (Chemical Markup Language) and also ran the development mailing list (XML-DEV) which the W3C process used for 3 years.
>  * Martin Fenner who popularised new approaches to scientific publications.
> A history (by Martin) can be found here: http://blogs.plos.org/mfenner/2011/03/19/a-very-brief-history-of-scholarly-html/
> Our vision in 2011 was that properly used HTML was sufficient and valuable for scholarly publications. SHTML2011 was a first pass at that. It now seems we have a larger critical mass. 
> Many positive things have happened in the last few years:
> * browsers are more conformant
> * CSS and SVG are accepted 
> * RDFa, and other forms of semantic documents are better supported (e.g. SPARQL).
> * there are many more open source tools.
> So technically we can do more or less whatever we want with relatively little effort and we can show useful demos with the latest tools in JS, etc. We just need to do it.
> The key points for me are:
> * do not try to be all-inclusive (JATS has 250 tags - a problem that all provider-centered schemas have - TEI is probably similar). 
> * allow for fluidity and evolution. Do not prescribe what cannot be done - if it's useful people will find a way of supporting it.
> * make it accessible to machines. Understanding the relations in a 250-tag set is impossible for anyone, so use as little as possible, Use standard HTML and the key sectioning tags for scholarship. 
> * get it out there .
> Personally I don't mind what it's called.
> Now what I want :-)
> I want to read the whole of the scholarly literature (10,000 articles/day) in HTML and extract the facts. I want something where my machines can read an HTML from a publisher and make sense of it. FWIW I already do this, but the publishers "HTML" is so awful that you would scream.  90% of a downloaded HTML is publisher cruft - "why publisher X is so wonderful", "see us on Facebook", "papers you might like to read (all published by us)", etc. Much of it is Javascript of unknown purpose.
> So in our AMI stack we throw away as much as possible - often 90%. Even then it's awful, We have
> div-for-everything <div class=p>, <div class="table"> , non-unicode stuff etc. and t's actually quite difficult to find offline parsers.
> So I just want standard HTML, with a small set of scholarly tags...
> ... a community of practice so I know that in a year or two my work won't be wasted.
> Hope that helps.
> P.

Received on Sunday, 10 September 2017 07:36:59 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:13:01 UTC