Re: "Show me the metadata!" :), was Re: Rough sketch for WP from Ivan Herman on 2016-09-26 (public-digipub-ig@w3.org from September 2016)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 26 Sep 2016 18:25:56 +0200
To: Robin Berjon <robin@berjon.com>
Cc: Tzviya Siegman <tsiegman@wiley.com>, Marcos Caceres <marcos@marcosc.com>, Baldur Bjarnason <baldur@rebus.foundation>, Dave Cramer <dave.cramer@hbgusa.com>, Michael Smith <mike@w3.org>, W3C Digital Publishing IG <public-digipub-ig@w3.org>, Peter Krautzberger <peter.krautzberger@mathjax.org>
Message-Id: <02603B89-9276-481E-BC14-56D341BCC914@w3.org>

> On 26 Sep 2016, at 18:13, Robin Berjon <robin@berjon.com> wrote:
> 
> On 26/09/2016 11:44 , Siegman, Tzviya - Hoboken wrote:
>> 3. In the scholarly publishing world, the line between content and
>> metadata is further blurred. It might be obvious to those of us in
>> the world of HTML that the title of an article should be tagged as
>> <h1>, but what about the subtitle? How do I tag author names? All of
>> this information must be displayed, not just tagged. How do I tag
>> this information in a way that makes it searchable in the NIH
>> database? This might not sound unique to Digital Publishing and look
>> a lot like issues that have plagued bloggers and those who have
>> pondered the outline algorithm for years. We welcome those solutions
>> and hope to build on them. But, I'd like to outline the kind of
>> complexities that we face and would be happy to show sample files in
>> a smaller setting. For now, most publishers work with a model that is
>> compliant with the JATS tag suite [3]. You'll notice that this is
>> XML, which is fine, but for it to work on a website, there has to be
>> a transform to something else (HTML, PDF, etc). That something else
>> has no standardization. You'll also notice that the <article-meta>
>> and the article are separate. This means that some basic information,
>> like the title get repeated. That is kind of annoying. Metadata in
>> this world also includes rather detailed information such as author
>> affiliations. Does this means the affiliation of the author at the
>> time of publication? What happens if the author transfers from one
>> university to another during the peer review process? Should the
>> affiliation change in the article at the time of publication? This
>> requires more than just an element or property in HTML. I don't think
>> we should attempt to make decisions about this level of granularity,
>> but we should make it possible for publishers and authors to do so. I
>> would be happy to talk to you about how we deal with this at my
>> company (Wiley). Another issue that I suspect is near and dear to the
>> hearts of many is how to convey whether an article is open access and
>> what type of access is allowed.  Wouldn't you prefer to know about
>> that if asked to review an article for one of the evil publishers?
> 
> One thing that I have found helpful (for people like Marcos and myself)
> when trying to make sense of requirements in the scholarly world is to
> think about the manner in which it is handled for W3C standards.
> 
> The way the W3C does it nowadays would likely send many scholarly
> publishers screaming, but that is where its ancestry lies. We have
> titles and subtitles (the latter often marked up the wrong way), authors
> and affiliations with some loose conventions to handle changes of
> affiliations over geologic^Wstandards time, levels of review, alternate
> formats and translations, and an abstract (the content of which is
> rarely an abstract as the Director will typically tell you during
> transition).
> 
> If you start from that and imagine that it is a radical modernisation of
> scholarly publishing with a lot more flexibility you're pretty close to
> the mark.

That is a good comparison. The only major difference is that the career of an editor of a W3C spec does not really depend on being very precise on these things, so some fuzziness is all right, whereas the very livelihood of scholarly authors depend on having ALL their publications accounted for by, say, a Google Scholar or similar tools, in spite of career changes, different transcriptions of their name into Latin characters if they are Russian or Chinese, not to be mixed up with the other John Smith who publishes in something totally different etc. Hence sloppiness is much less acceptable...

> 
> To address metadata encoding more specifically, it shouldn't come as a
> surprise to some here that I would advocate for schema.org as a
> sensible, widely deployed and developer-adopted option. Maybe some of
> the work that we've done with Scholarly HTML ought to be applied more
> generally (with some scholarly specifics such as using
> `hasDigitalDocumentPermission` to mark open access)? Some of the
> modelling is a bit indirect (for instance affiliations are indirected
> precisely because they are ephemeral) but a lot of it is generic enough.
> It can be used in a manifest as JSON-LD, which could be sweet.

A very dear friend and colleague used to say: "I know the jungle and therefore I am afraid of the jungle." :-) The metadata world is incredibly messy, and I do not think we should make any decisions as for what vocabularies, etc, a WP would use, except maybe for the absolutely core 3-4 terms. Otherwise let the libraries, publishers, scholars of all types, etc, fight this war. What, in my view, we should have is a clearly identified pointer (reference, URI, whatever) of some sort where a publisher/author could place his/her metadata in a preferred format. As far as I am concerned, we should stop there…

Cheers

Ivan


> 
> --
> • Robin Berjon - http://berjon.com/ - @robinberjon
> • http://science.ai/ — intelligent science publishing
> •


----
Ivan Herman, W3C
Digital Publishing Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

Received on Monday, 26 September 2016 16:26:13 UTC