Re: Some Design Principles from Ivan Herman on 2015-12-01 (public-scholarlyhtml@w3.org from December 2015)

From: Ivan Herman <ivan@w3.org>
Date: Tue, 1 Dec 2015 09:59:39 +0100
To: Robin Berjon <robin@berjon.com>
Cc: W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>
Message-Id: <2900FADD-52D8-4D3B-9E07-A0D0E42CAF4B@w3.org>
Robin,

these are just some additions, not disputing anything that you say. The common thread in these comments is that we may have to explore, and be aware of, some external constraints that may influence our design.

1. The sad reality is that academic publishing is very conservative (because people's livelihood depends on playing by the rules), which also means that, at the end of the day, their publication must end up by reputable(?) journals. And, at least in 2015, most of those journals are still bound by formatting rules that are, though antiquated, nevertheless prevalent (the ACM or Springer formats are probably the best known examples in computer science or mathematics). This means that any environment based on SH can be successful only if it is possible to produce, through some clever software, HTML *as well as* PDF formats that abide to those rules. Similarly, authors still use Microsoft Word, mostly, to author their articles and tools must exist to convert those into SH. These formats are indeed arbitrary and, as I said, antiquated and often motivated by the not-invented-here syndrome (a good example is the absolutely crazy proliferation of various formal reference formats in the bibliography). But I believe that we have to be pragmatic and not forget their existence.

Whether these considerations will affect the final definition of SH: I do not know. Mostly not, but maybe yes in some places (eg, the reference format and vocabulary we use). But I think keeping this in our back of our mind all the time is important. Silvio's RASH format is a good example for such a full(er) environment, but I let him comment on the details

(A good example from my own experience: I am part of the steering committee for the WWW201X conferences. Our 'proceedings' has also been published by the ACM, in their digital library, for many years, although we also maintain the proceedings, free of charge for everybody, on the Web. We would like to have a purely HTML based proceedings eventually, and we are seriously considering doing that for WWW2017. But it is clear for us that having a copy of the articles in the ACM DL is important for our constituency.)

2. We should not look at SH in isolation, but should also consider the environment they would be in. Of course, SH is an interchange format but, on long term, we would also like to see academic journals that are fully Web based using SH at their core. As an example, I am co-author of a paper at PeerJ CS[1]: as an author, the advantage is not (only) that it is in HTML, but also the whole process of getting there which was offered (superbly, I must add) by PeerJ: the reviewing, the commenting, etc. SH should make that easy and smooth. Obviously, it is *not* our goal to come up with a standard environment like PeerJ. But, again, design consideration should keep that in the back of our mind.

B.t.w., it would be good to have people from places like PeerJ, PLOS, arXiv, etc, in our midst…

3. The issue of archiving came up. I think we should also seriously consider, from the start, that an SH, more exactly the SH plus the surrounding information, should also be storable, for offline usage *and* archiving, in EPUB of some version. Doing that means we can provide a proper offline usage for the paper (and forget about PDF in that respect) as well as ensure a certain level of defense against link-rot.

EPUB 3 definitely has some restrictions on what can or cannot be included and what format can be used; we should know that. Note also that the W3C DPUB IG is working on a more general vision, a *draft* called Portable Web Publications[2] that may be the right environment to consider in the future. Again: we should keep that approach for archiving in mind...

(Full disclosure: I am co-editor of [2].)

Cheers

Ivan


[1] https://peerj.com/articles/cs-1/
[2] https://w3c.github.io/dpub-pwp/


> On 1 Dec 2015, at 05:13, Robin Berjon <robin@berjon.com> wrote:
> 
> 
> With the focus on interchange in mind, some design considerations come
> to mind.
> 
> The first is the scope of the data model for SH. I think that it should
> be the article. Don't get me wrong, I'm just as excited as the next
> person about liberating knowledge not just from antiquated formats but
> also from the shackles of the article form. The current insistence on
> narratives can be toxic, it can be very limiting, it slows down some
> areas of science, and it stymies reuse.
> 
> Having said that, the article is not going away any time soon, I would
> expect ever. Narratives are also very useful, when called for (which is
> often). And of course, there is a *lot* of existing content in article form.
> 
> This does not at all mean that projects to apply linked data to
> research, of which Linked Research is one great example, are wrong. On
> the contrary, I think SH should be designed in such a way that it can
> integrate well with them. This enables pipelines such as legacy content
> -> content mine -> LR. The result is likely not as great as if we got
> every researcher to use LR for everything from the get go, but that
> seems unlikely. With this approach, we can upgrade gradually.
> 
> Having articles as its scope, the choice of HTML as the baseline format
> should be (hopefully) obvious.
> 
> HTML can be quite a mess though, so we can't just say "HTML" and expect
> anything to work. You don't want <applet> and <marquee> of course, but
> you probably also don't want just a flat list of styled paragraphs (e.g.
> Word's data model). We need a specifically structured subset of HTML.
> 
> HTML is also limited in its semantics. It has a few things for scholarly
> content (such as paragraphs and sections) but that only takes you so
> far. Thankfully, that's where DPUB-ARIA kicks in. But we likely don't
> want all that's in DPUB-ARIA either, nor do we want it used arbitrarily.
> It is designed to also support exotic content, say books, that might be
> out of scope (or it might not — up for discussion). We probably want to
> rely on a prescriptive subset of DPUB-ARIA.
> 
> Then there are parts that DPUB-ARIA doesn't cover because it is generic
> to publishing and we are specialised to scholarly content (e.g.
> capturing sources of funding). For those parts we need to avail
> ourselves of semantic extension mechanisms like Microdata or RDFa (I
> would say more likely the latter if we prefer to use a format that isn't
> half-abandoned, though both have issues).
> 
> This then opens the question of which ontology/-ies to choose. My
> contention, which I know is not universally shared, is that semantics
> are only as useful as they are shared. Obviously, this has limits. My
> 6yo asked me the other day why we bothered having words like "house"
> when we could just as well get away with building-people-live-in, and we
> had a fun time regressing that into impossibly long words. If the most
> broadly understood vocabularies don't have a concept that *roughly*
> fits, then we can look into less used ones, and then we can invent
> something. Our SH currently makes use of an ad hoc ontology[0] but we
> consider that a bug — we plan to replace it entirely.
> 
> Semantic overlays required by the spec should also be restricted by use
> cases. Ideally there should be a common interoperable baseline that one
> can always expect to find, and then people who want to can go crazy on
> top of that. That enables interoperability and freedom at the same time.
> 
> So essentially, I propose that SH be entirely comprised of subsets of
> existing standards, with simple extensibility rules that dictate what
> can be guaranteed to interoperate, and what can be added safely but
> might not be universally understood. This is relatively easy to get right.
> 
> [0] https://github.com/scienceai/scholarly-article/
> 
> --
> • Robin Berjon - http://berjon.com/ - @robinberjon
> • http://science.ai/ — intelligent science publishing
> •
> 


----
Ivan Herman, W3C
Digital Publishing Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704
Received on Tuesday, 1 December 2015 08:59:50 UTC