Re: html for scholarly communication: RASH, Scholarly HTML or Dokieli? from Peter Murray-Rust on 2017-09-10 (public-scholarlyhtml@w3.org from September 2017)

From: Peter Murray-Rust <pm286@cam.ac.uk>
Date: Sun, 10 Sep 2017 12:17:03 +0100
To: Ivan Herman <ivan@w3.org>
Cc: W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>, Sarven Capadisli <info@csarven.ca>
Message-ID: <CAD2k14N1wajnrUbpscO51PMMWZV=JR8daye61-J3tFT8EJDtvA@mail.gmail.com>
On Sun, Sep 10, 2017 at 11:34 AM, Ivan Herman <ivan@w3.org> wrote:

>
>  My main point was: I think we need comparative data which we do not
> really have. (Note that I do not have any take in any of these formats,
> i.e., I do not really care which one comes out as the most appropriate
> one!).
>
> That's my position as well.

Here is a very concrete suggestion:

Europe PubMedCentral (http:europepmc.org - disclaimer: I was on the Project
Board for many years) creates semantic versions (in JATS XML) of all Open
Access biomedical papers - runs into several million. All papers are
converted into HTML, effectively independent of the publisher - there is no
lazy JS-based loading - all content is present in a single HTML file. Here
are some examples each from a different publisher:

http://europepmc.org/articles/PMC3915084
http://europepmc.org/articles/PMC3920264
http://europepmc.org/articles/PMC3953398

Note that the output is publishers-independent, clean HTML with tagsets
from JATS, DC etc. Whatever format or tagset is recommended by SH-CG I
would expect it to be a day/weekend's work to write a converter.
Here , for example are some main components:

Metadata (in HEAD)

<meta name="dc:date" content="2014/02"/><meta name="dcterms:isPartOf"
content="Mycopathologia [2014, 177(1-2)], pp29-39"/><meta
name="dc:identifier"
content="https://www.ncbi.nlm.nih.gov/pubmed/24436010"/>

<title>The Influence of Chemical Composition of Commercial Lemon
Essential Oils on the Growth of... - Europe PMC Article - Europe
PMC</title>
and optional style:
<link rel="stylesheet"
href="//maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css">


Main body:


<span class="abs_citation_title" itemprop="name" datatype="xsd:string"
    property="dc:title">The influence of chemical composition of commercial
 lemon essential oils on the growth of Candida strains.</span>

 <div class="epmc_pageHolder articleContentPage fullPage" itemscope
    itemtype="http://schema.org/ScholarlyArticle"><span
style="display:none" property="dc:abstract" datatype="xsd:string"
    itemprop="description">Candida yeasts are saprophytes naturally
...
by C. albicans.</span><meta itemprop="datePublished"
content="2014/02"/><div id="article_body" itemprop="articleBody">...



>
IMO that gives us all the semantics we need for biomedical articles. It
uses standards where possible (DC, HTML) and has virtually no cruft. Many
years of work have gone into this and it is almost certainly the largest
body of standard scholarly material in HTML.

The only downside is that it's primarily for biomedical. But I don't see
anything that stops it being used for most other disciplines given that we
know how to link things together now and browsers generally honour this.

As it's portrayed here there are a few chunks that are repository cruft,
not content but the main body will do what we need. I strongly suggest that
we created several SH CG examples from this as it represents real,
forward-looking practice deployed in the field millions of times.

P.


P.



>
> Ivan
>
>
> ---
> Ivan Herman
> World Wide Web Consortium
> Publishing@W3C Technical Lead
> http://www.w3.org/People/Ivan/
> ORCID: 0000-0003-0782-2704
>
>
> On 10 Sep 2017, 11:36 +0200, Sarven Capadisli <info@csarven.ca>, wrote:
>
> On 2017-09-10 09:49, Ivan Herman wrote:
>
> I am afraid we are engaging in some sort of theoretical discussion here
> which will never end: do we want to use the full of HTML5 or do we want
> to define a smaller structure by restricting to a subset of HTML5? I
> would think that we would be a bit ahead of this after the experiment
> Benjamin proposed: let us take a few real articles from various fields
> and see how the score with the RASH and SH; it will become easier to
> have an idea.
>
> Maybe one more step would be, for each of those to also see how easy it
> would be for some of these articles to be formatted via CSS (or maybe
> CSS+JavaScript) to the formats that are in use (ACM, IEEE, etc). I am
> particularly worried about the incredible differences in article
> reference formats out there, and how could one author a paper so that
> the content could be adapted to any existing requirements (there is a
> reason why BiBTex is a separate engine to LaTeX…)
>
> Hi Ivan,
>
> have you come across dokieli?
>
> Allow me to introduce it to you in context of CSS. First, see some of
> dokieli's HTML patterns:
>
> https://dokie.li/docs#html-patterns
>
> It is used for the following different kinds of documents, all with
> different primary stylesheets (including print). Alternative stylesheets
> can be triggered from the dokieli menu or through supporting
> user-agents. There is *no* JavaScript requirement for the user-agent to
> get a hold of the "data".
>
> Articles:
> * http://csarven.ca/dokieli-rww (scholarly article with dynamic
> annotations)
> * http://csarven.ca/cooling-down-web-science (a pretty blog post)
> * https://dokie.li/ (a webpage)
> * https://linkedresearch.org/ (another webpage)
> * https://www.w3.org/TR/ldn/ (a W3C specification)
> * https://rhiaro.github.io/thesis/chapter1 (a thesis chapter)
> * http://ceur-ws.org/Vol-1549/ (a workshop proceeding)
> * http://semstats.org/2016/call-for-contributions (call for "papers")
> * https://dokie.li/acm-sigproc-sp (ACM Authoring guidelines)
> * https://dokie.li/lncs-splnproc (Springer/LNCS)
> * https://data.gov.ie/strategy (Ireland's open data strategy)
>
> Annotations:
> * As you well know examples in https://www.w3.org/TR/annotation-html/
> derived from dokieli's patterns.
>
> Notifications:
> * eg.
> https://linkedresearch.org/annotation/csarven.ca/dokieli-rww
> /b6738766-3ce5-4054-96a9-ced7f05b439f
>
> Plenty more at:
>
> https://github.com/linkeddata/dokieli/wiki#examples-in-the-wild
>
> with different scholarly articles containing various scholarly information.
>
> Happy to report that we've covered a wider range of "scholarly
> information" than anything else on the table here. If that's an
> incorrect assumption, people can come forward with URLs to existing work.
>
> So, are the HTML patterns documented flexible enough to handle different
> cases, including scholarly information? Evidence suggests it to be the
> case. Happy to improve where necessary as always. It is not bullet
> proof. The patterns have come to a point (certainly not the end) where a
> range of things can be expressed without arbitrary or artificial
> constraints set. So, I'm having a hard time buying the argument for any
> subsetting unless one has the intention for the information to work
> *only* under certain 1) tools and 2) versions - let's face it, the
> minute we draw the line what's allowed and not allowed, that has to be
> dealt with straight on.
>
> For dokieli (in case the /docs is a boring read, nor do I expect anyone
> to read it, ... at the risk of repeating myself):
>
> * Information is human and machine-readable to the greatest extent
> possible.
> * Consuming core information does not require JavaScript and gives the
> lowest barrier for any consuming agent. Heck, try it out with links/lynx
> and compare it with whatever is brought to table in this mailing list.
> * dokieli's intention is to allow the expressing HTML as accurately as
> possible (either by hand or as much as the UI allows), and put focus on
> RDF(a) for data/information exchange.
>
> What do the alternative "formats" do beyond *only* working with the
> frameworks they are capable of working within? Not human and
> machine-readable as they could be, that's for certain, in my opinion.
>
> If you are all stuck on having a "formal" "format" or whatever, I'll
> write a grammar for it and we can discuss that. How about that?
>
> PS: I sincerely apologise for the repetition (and probably the tone),
> but I feel that I'm probably not making my points clear enough. So, I
> guess I'll back off the mailing list "for a bit" :)
>
> Bon weekend,
>
> -Sarven
> http://csarven.ca/#i
>
>


-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069 <+44%201223%20763069>
Received on Sunday, 10 September 2017 11:17:28 UTC