Re: html for scholarly communication: RASH, Scholarly HTML or Dokieli? from Peter Murray-Rust on 2017-09-08 (public-scholarlyhtml@w3.org from September 2017)

From: Peter Murray-Rust <pm286@cam.ac.uk>
Date: Fri, 8 Sep 2017 13:13:15 +0100
To: Johannes Wilm <mail@johanneswilm.org>
Cc: Ivan Herman <ivan@w3.org>, Benjamin Young <byoung@bigbluehat.com>, Silvio Peroni <silvio.peroni@unibo.it>, Robin Berjon <robin@berjon.com>, W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>, "Peter (pt) Sefton" <ptsefton@gmail.com>, Brian McMahon <bm@iucr.org>, "Rzepa, Henry" <h.rzepa@imperial.ac.uk>, Martin Fenner <mf@martinfenner.org>
Message-ID: <CAD2k14NwDTmCQ5bLRCJBkO1E1bJArhQS7KKh4okc=RQSQq0-Gg@mail.gmail.com>
(copying some of my erstwhile collaborators and originators of
"ScholarlyHTML" 6 years ago.)

For starters I am happy to lend my efforts to whatever emerges from this.

When proposing ScholarlyHTML in 2011 (in the Panton Arms pub in Cambridge)
we were conscious that HTML should be used as a simple, powerful,
well-supported basis for scholarly publications (not specifically STEM, but
probably with a tendency towards it). NLM-DTD (JATS XML as it now is) is
too complex for the average scientist and is aimed at publishers,
repository managers, archivers, etc.). The group included
 * Brian McMahon from IUCr (who have developed the best semantic scientific
publishing engine anywhere - CIF - in the world for data-rich science). CIF
works, is used and is even loved by many. CIF is fairly isomorphous with
HTML - i.e. there can be lossless machine interchange.
 * Peter (PT) Sefton , who in Univ Southern Queensland (USQ) developed ICE
- the best scholarly authoring tool which was actually used and loved by
practising academics
 * Henry Rzepa and PMR, who have developed XML for chemistry (Chemical
Markup Language) and also ran the development mailing list (XML-DEV) which
the W3C process used for 3 years.
 * Martin Fenner who popularised new approaches to scientific publications.

A history (by Martin) can be found here:
http://blogs.plos.org/mfenner/2011/03/19/a-very-brief-history-of-scholarly-html/

Our vision in 2011 was that properly used HTML was sufficient and valuable
for scholarly publications. SHTML2011 was a first pass at that. It now
seems we have a larger critical mass.

Many positive things have happened in the last few years:
* browsers are more conformant
* CSS and SVG are accepted
* RDFa, and other forms of semantic documents are better supported (e.g.
SPARQL).
* there are many more open source tools.

So technically we can do more or less whatever we want with relatively
little effort and we can show useful demos with the latest tools in JS,
etc. We just need to do it.

The key points for me are:
* do not try to be all-inclusive (JATS has 250 tags - a problem that all
provider-centered schemas have - TEI is probably similar).
* allow for fluidity and evolution. Do not prescribe what cannot be done -
if it's useful people will find a way of supporting it.
* make it accessible to machines. Understanding the relations in a 250-tag
set is impossible for anyone, so use as little as possible, Use standard
HTML and the key sectioning tags for scholarship.
* get it out there .

Personally I don't mind what it's called.

Now what I want :-)

I want to read the whole of the scholarly literature (10,000 articles/day)
in HTML and extract the facts. I want something where my machines can read
an HTML from a publisher and make sense of it. FWIW I already do this, but
the publishers "HTML" is so awful that you would scream.  90% of a
downloaded HTML is publisher cruft - "why publisher X is so wonderful",
"see us on Facebook", "papers you might like to read (all published by
us)", etc. Much of it is Javascript of unknown purpose.

So in our AMI stack we throw away as much as possible - often 90%. Even
then it's awful, We have
div-for-everything <div class=p>, <div class="table"> , non-unicode stuff
etc. and t's actually quite difficult to find offline parsers.

So I just want standard HTML, with a small set of scholarly tags...
AND
... a community of practice so I know that in a year or two my work won't
be wasted.

Hope that helps.

P.








On Fri, Sep 8, 2017 at 11:37 AM, Johannes Wilm <mail@johanneswilm.org>
wrote:

> Great!
>
> I am an anthropologist and a historian. I can put something together in
> those areas, if someone else takes STEM.
>
> On Fri, Sep 8, 2017 at 7:02 AM, Ivan Herman <ivan@w3.org> wrote:
>
>> That sounds like a great idea.
>>
>> I would also think doing that with two different types of papers would be
>> beneficial, namely one from a STEM field and one from, say, a historian or
>> sociologist. In my limited and anecdotical experience the habits in
>> humanities may be different than what we are used to in the technical
>> fields.
>>
>> Ivan
>>
>>
>>
>> On 7 Sep 2017, at 22:26, Benjamin Young <byoung@bigbluehat.com> wrote:
>>
>> Anyone fancy doing a comparative analysis or even mocking up the same
>> (ideally rather complex) article in ScholarlyHTML, RASH, and anything else
>> we'd care to compare/discuss?
>>
>> There are obviously overlaps from all these fabulous attempts (including
>> the internal ones at many publishers). It would be great to understand what
>> (beyond simple syntax choices) is quantifiably different in the approaches.
>>
>> One key thing provided by the W3C (as with the Apache Software
>> Foundation, etc) is clear governance and IP-related clearance.
>>
>> For something to be solidified in the marketplace, having those
>> governance and IP stuff clearly stated, organized, and operated on would be
>> most helpful.
>>
>> Thanks!
>> Benjamin
>>
>> --
>> http://bigbluehat.com/
>> http://linkedin.com/in/benjaminyoung
>> ------------------------------
>> *From:* Silvio Peroni <silvio.peroni@unibo.it>
>> *Sent:* Wednesday, September 6, 2017 5:35:55 PM
>> *To:* Johannes Wilm
>> *Cc:* Robin Berjon; Peter Murray-Rust; W3C Scholarly HTML CG
>> *Subject:* Re: html for scholarly communication: RASH, Scholarly HTML or
>> Dokieli?
>>
>> Hi Johannes,
>>
>> Just a clarification:
>>
>> I guess RASH is more tied to specific tools, and from the looks of it,
>> the format is not governed by any formal decision making process, so it's
>> basically up to the development team behind it? I mean I understand,. Our
>> Fidus Writer format is also just what we decide to put into it. But I
>> wouldn't expect anyone else to adopt it either.
>>
>>
>> Well, the first version of RASH has been released as a work of my
>> colleagues and I. However, we have been always open to suggestions and push
>> requests via the Github repo, in particular when compliant with the
>> intended guidelines of the language – be pattern-based according to a
>> specific theory, adopt a minimum number of elements that enable the full
>> description of a scholarly paper, use one element for conveying a specific
>> structural semantics (e.g. you cannot choose between “em” and “i”, you have
>> to use “em”), avoid verbosity when possible (see how in-text reference
>> pointers to bibliographic references are handled), etc.
>>
>> In fact RASH has been modified and extended in the past thanks to several
>> contributions and suggestions by the community – e.g. single researchers,
>> as well as W3C working groups, such as DPUB-ARIA. The format has not been
>> changed anymore since one year so far – we think it is pretty stable,
>> indeed –, and we are focussing on the development of tools to extend the
>> Framework right now – as side projects and/or student thesis. Of course
>> RASH is not a formal standard, since it is not released by any standard
>> organisation or institute. However it is a formal (i.e. there is a RelaxNG
>> grammar defining it) subset of HTML5.
>>
>> If my suspicion is correct, it sounds like the main difference is that in
>> RASH, several different ways of doing the same are allowed, whereas in
>> Scholarly HTML, just one way is allowed.
>>
>>
>> If you consider RASH as a format, then honestly it is quite strict, since
>> it allows to markup scholarly documents in a precise way, as defined in its
>> documentation (https://rawgit.com/essepuntato/rash/master/documentation/
>> index.html) – while leaving the freedom of specifying RDF statements
>> using any vocabulary.
>>
>> If you consider the RASH Framework (i.e. the set of tools available to
>> work with the RASH format) then yes, you can use different WYSIWYG ways
>> (OpenOffice, Word, and RAJE – the latter still in alpha testing) for
>> obtaining RASH documents, plus of course the possibility of writing a RASH
>> document by using a common text editor.
>>
>> If the tools exist for RASH but not for Scholarly HTML, could we then not
>> simply choose one of the various ways to express things in RASH and use
>> that sub format for interchange? Something like "Strict RASH”. And would it
>> not be possible to continue the development of that under some kind of
>> community (if that is not the case yet), so that others can have a stake in
>> it as well?
>>
>>
>> As mentioned before, I think RASH is enough strict as HTML5 markup – you
>> have not three different ways to express article structure, you have only
>> *one* way to do that, so as to remove ambiguities. And, for what RASH is
>> concerned, I would love to organise or be involved in a formal community so
>> as to discuss how to extend it and its Framework, according to the needs of
>> various actors. Thus, I’m happy to talk about this, if there is interest.
>> Even, and in particular, in the Scholarly HTML Community Group, if people
>> think is the appropriate space for such discussion.
>>
>> Have a nice day :-)
>>
>> S.
>>
>>
>>
>>
>>
>> ------------------------------------------------------------
>> ----------------
>> Silvio Peroni, Ph.D.
>> Department of Computer Science and Engineering
>> University of Bologna, Bologna (Italy)
>> Tel: +39 051 2095393 <+39%20051%20209%205393>
>> E-mail: silvio.peroni@unibo.it
>> Web: https://www.unibo.it/sitoweb/silvio.peroni/en
>> Twitter: essepuntato
>>
>>
>>
>> ----
>> Ivan Herman, W3C
>> Publishing@W3C Technical Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153 <+31%206%2041044153>
>> ORCID ID: http://orcid.org/0000-0003-0782-2704
>>
>>
>
>
> --
> Johannes Wilm
> http://www.johanneswilm.org
> tel: +1 (520) 399 8880 <(520)%20399-8880>
>



-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Received on Friday, 8 September 2017 12:13:40 UTC