Some Design Principles

With the focus on interchange in mind, some design considerations come
to mind.

The first is the scope of the data model for SH. I think that it should
be the article. Don't get me wrong, I'm just as excited as the next
person about liberating knowledge not just from antiquated formats but
also from the shackles of the article form. The current insistence on
narratives can be toxic, it can be very limiting, it slows down some
areas of science, and it stymies reuse.

Having said that, the article is not going away any time soon, I would
expect ever. Narratives are also very useful, when called for (which is
often). And of course, there is a *lot* of existing content in article form.

This does not at all mean that projects to apply linked data to
research, of which Linked Research is one great example, are wrong. On
the contrary, I think SH should be designed in such a way that it can
integrate well with them. This enables pipelines such as legacy content
-> content mine -> LR. The result is likely not as great as if we got
every researcher to use LR for everything from the get go, but that
seems unlikely. With this approach, we can upgrade gradually.

Having articles as its scope, the choice of HTML as the baseline format
should be (hopefully) obvious.

HTML can be quite a mess though, so we can't just say "HTML" and expect
anything to work. You don't want <applet> and <marquee> of course, but
you probably also don't want just a flat list of styled paragraphs (e.g.
Word's data model). We need a specifically structured subset of HTML.

HTML is also limited in its semantics. It has a few things for scholarly
content (such as paragraphs and sections) but that only takes you so
far. Thankfully, that's where DPUB-ARIA kicks in. But we likely don't
want all that's in DPUB-ARIA either, nor do we want it used arbitrarily.
It is designed to also support exotic content, say books, that might be
out of scope (or it might not — up for discussion). We probably want to
rely on a prescriptive subset of DPUB-ARIA.

Then there are parts that DPUB-ARIA doesn't cover because it is generic
to publishing and we are specialised to scholarly content (e.g.
capturing sources of funding). For those parts we need to avail
ourselves of semantic extension mechanisms like Microdata or RDFa (I
would say more likely the latter if we prefer to use a format that isn't
half-abandoned, though both have issues).

This then opens the question of which ontology/-ies to choose. My
contention, which I know is not universally shared, is that semantics
are only as useful as they are shared. Obviously, this has limits. My
6yo asked me the other day why we bothered having words like "house"
when we could just as well get away with building-people-live-in, and we
had a fun time regressing that into impossibly long words. If the most
broadly understood vocabularies don't have a concept that *roughly*
fits, then we can look into less used ones, and then we can invent
something. Our SH currently makes use of an ad hoc ontology[0] but we
consider that a bug — we plan to replace it entirely.

Semantic overlays required by the spec should also be restricted by use
cases. Ideally there should be a common interoperable baseline that one
can always expect to find, and then people who want to can go crazy on
top of that. That enables interoperability and freedom at the same time.

So essentially, I propose that SH be entirely comprised of subsets of
existing standards, with simple extensibility rules that dictate what
can be guaranteed to interoperate, and what can be added safely but
might not be universally understood. This is relatively easy to get right.

[0] https://github.com/scienceai/scholarly-article/

-- 
• Robin Berjon - http://berjon.com/ - @robinberjon
• http://science.ai/ — intelligent science publishing
•

Received on Tuesday, 1 December 2015 04:14:18 UTC