W3C home > Mailing lists > Public > public-digipub-ig@w3.org > September 2016

RE: "Show me the metadata!" :), was Re: Rough sketch for WP

From: Bill Kasdorf <bkasdorf@apexcovantage.com>
Date: Tue, 27 Sep 2016 09:08:29 +0000
To: David Wood <david.wood@ephox.com>, Ivan Herman <ivan@w3.org>
CC: Robin Berjon <robin@berjon.com>, Tzviya Siegman <tsiegman@wiley.com>, Marcos Caceres <marcos@marcosc.com>, Baldur Bjarnason <baldur@rebus.foundation>, Dave Cramer <dave.cramer@hbgusa.com>, "Michael Smith" <mike@w3.org>, W3C Digital Publishing IG <public-digipub-ig@w3.org>, Peter Krautzberger <peter.krautzberger@mathjax.org>
Message-ID: <CY1PR0601MB1422711D183DB5279F1D115CDFCC0@CY1PR0601MB1422.namprd06.prod.outlook.com>
I agree, presuming that one possible choice is to put the metadata in the publication (though by far the more common and preferred practice would be to maintain it externally to the publication). This is consistent with current actual practice regarding metadata in most (but not all) sectors of publishing.

--Book publishing metadata is overwhelmingly managed and disseminated separately from publication files, typically as ONIX feeds, as many people have already observed.

--Journal article metadata is commonly embedded in the article XML (typically JATS XML or, previously, NLM XML), in the "metadata header" that model provides. This is a very longstanding and virtually universal practice in scholarly publishing. As Tzviya observed, this does create a certain level of ambiguity or duplication between things like titles, contributor names, affiliations, etc. that are both metadata and rendered content. But technically, in the JATS XML, these things are metadata, and a rendering system needs to go fetch them from that place when rendering an article. (She also appropriately pointed out that this involves a transform to some other model; you don't actually render JATS, you render PDF or HTML.) So it is very useful for these things (titles, contributor names, affiliations, etc.) to be part of the article's XML. Nevertheless, that same metadata (or portions of it) is also extracted and used separately from the article files for purposes such as registering the article with Crossref to obtain a Crossref DOI, managing the article in the journal's hosting platform, etc.

--News is published today in an enormous volume of small chunks of content. It is essential in that world for metadata to be embedded in the content and to be machine readable because it has to travel with the content as it is duplicated and disseminated widely. (The AP, just one news organization, creates 250,000 individual publishable assets every day.) That doesn't mean it isn't also maintained in a master metadata database. So this particularly raises the issue of where the authoritative and up-to-date metadata resides. News organizations need to be very conscious of which metadata is stable, and thus embeddable, and which metadata is volatile, and thus should be maintained externally. Externally always wins; that master metadata can be embedded _from time to time in its then-current state_ when content is interchanged at specific times for specific purposes.

--Educational publications are in the midst of a transition from the book-like model to the news-like model: from big monolithic publications (which still exist, of course) to huge collections of components that can be distributed independently and recombined in various ways.

--Magazines are making the same transition, though the "publications" are smaller and less monolithic than a big textbook and the "sub-publications," if you will (I resist characterizing them as components because they are in themselves complex publications, not typically a single file), are more stable once created (less slicing and dicing and recombining than goes on in the educational publishing world).

BTW, these are front-and-center issues for the Permissions and Obligations WG (POE WG) as they expand their use cases and refine ODRL.

I hope this is interesting, and you don't think I've wasted my breath because really all I've done is confirmed the importance of maintaining the metadata independently of the publication but enabling it to be embedded in it.

Bill Kasdorf

VP and Principal Consultant | Apex CoVantage


734-904-6252  m:   734-904-6252

ISNI: http://isni.org/isni/0000000116490786

ORCiD: https://orcid.org/0000-0001-7002-4786<https://orcid.org/0000-0001-7002-4786?lang=en>

From: David Wood [mailto:david.wood@ephox.com]
Sent: Tuesday, September 27, 2016 4:18 AM
To: Ivan Herman
Cc: Robin Berjon; Tzviya Siegman; Marcos Caceres; Baldur Bjarnason; Dave Cramer; Michael Smith; W3C Digital Publishing IG; Peter Krautzberger
Subject: Re: "Show me the metadata!" :), was Re: Rough sketch for WP

Hi all,

On Mon, Sep 26, 2016 at 5:25 PM, Ivan Herman <ivan@w3.org<mailto:ivan@w3.org>> wrote:

> On 26 Sep 2016, at 18:13, Robin Berjon <robin@berjon.com<mailto:robin@berjon.com>> wrote:
> On 26/09/2016 11:44 , Siegman, Tzviya - Hoboken wrote:
>> 3. In the scholarly publishing world, the line between content and
>> metadata is further blurred. It might be obvious to those of us in
>> the world of HTML that the title of an article should be tagged as
>> <h1>, but what about the subtitle? How do I tag author names? All of
>> this information must be displayed, not just tagged. How do I tag
>> this information in a way that makes it searchable in the NIH
>> database? This might not sound unique to Digital Publishing and look
>> a lot like issues that have plagued bloggers and those who have
>> pondered the outline algorithm for years. We welcome those solutions
>> and hope to build on them. But, I'd like to outline the kind of
>> complexities that we face and would be happy to show sample files in
>> a smaller setting. For now, most publishers work with a model that is
>> compliant with the JATS tag suite [3]. You'll notice that this is
>> XML, which is fine, but for it to work on a website, there has to be
>> a transform to something else (HTML, PDF, etc). That something else
>> has no standardization. You'll also notice that the <article-meta>
>> and the article are separate. This means that some basic information,
>> like the title get repeated. That is kind of annoying. Metadata in
>> this world also includes rather detailed information such as author
>> affiliations. Does this means the affiliation of the author at the
>> time of publication? What happens if the author transfers from one
>> university to another during the peer review process? Should the
>> affiliation change in the article at the time of publication? This
>> requires more than just an element or property in HTML. I don't think
>> we should attempt to make decisions about this level of granularity,
>> but we should make it possible for publishers and authors to do so. I
>> would be happy to talk to you about how we deal with this at my
>> company (Wiley). Another issue that I suspect is near and dear to the
>> hearts of many is how to convey whether an article is open access and
>> what type of access is allowed.  Wouldn't you prefer to know about
>> that if asked to review an article for one of the evil publishers?
> One thing that I have found helpful (for people like Marcos and myself)
> when trying to make sense of requirements in the scholarly world is to
> think about the manner in which it is handled for W3C standards.
> The way the W3C does it nowadays would likely send many scholarly
> publishers screaming, but that is where its ancestry lies. We have
> titles and subtitles (the latter often marked up the wrong way), authors
> and affiliations with some loose conventions to handle changes of
> affiliations over geologic^Wstandards time, levels of review, alternate
> formats and translations, and an abstract (the content of which is
> rarely an abstract as the Director will typically tell you during
> transition).
> If you start from that and imagine that it is a radical modernisation of
> scholarly publishing with a lot more flexibility you're pretty close to
> the mark.
That is a good comparison. The only major difference is that the career of an editor of a W3C spec does not really depend on being very precise on these things, so some fuzziness is all right, whereas the very livelihood of scholarly authors depend on having ALL their publications accounted for by, say, a Google Scholar or similar tools, in spite of career changes, different transcriptions of their name into Latin characters if they are Russian or Chinese, not to be mixed up with the other John Smith who publishes in something totally different etc. Hence sloppiness is much less acceptable...

> To address metadata encoding more specifically, it shouldn't come as a
> surprise to some here that I would advocate for schema.org<http://schema.org> as a
> sensible, widely deployed and developer-adopted option. Maybe some of
> the work that we've done with Scholarly HTML ought to be applied more
> generally (with some scholarly specifics such as using
> `hasDigitalDocumentPermission` to mark open access)? Some of the
> modelling is a bit indirect (for instance affiliations are indirected
> precisely because they are ephemeral) but a lot of it is generic enough.
> It can be used in a manifest as JSON-LD, which could be sweet.

A very dear friend and colleague used to say: "I know the jungle and therefore I am afraid of the jungle." :-) The metadata world is incredibly messy, and I do not think we should make any decisions as for what vocabularies, etc, a WP would use, except maybe for the absolutely core 3-4 terms. Otherwise let the libraries, publishers, scholars of all types, etc, fight this war. What, in my view, we should have is a clearly identified pointer (reference, URI, whatever) of some sort where a publisher/author could place his/her metadata in a preferred format. As far as I am concerned, we should stop there…

Lest it be lost in a long paragraph, Ivan is proposing to reference most metadata by URL or other identifier so that it could be held wherever a publisher wishes. Hopefully a reader, or the reader's software, could follow the reference to get the metadata if they wish.

Further, the details of the metadata and the vocabularies used within it would be considered outside the scope of this group.

+1 to that approach.

David Wood



> --
> • Robin Berjon - http://berjon.com/ - @robinberjon
> • http://science.ai/ — intelligent science publishing
> •

Ivan Herman, W3C
Digital Publishing Lead
Home: http://www.w3.org/People/Ivan/

mobile: +31-641044153<tel:%2B31-641044153>
ORCID ID: http://orcid.org/0000-0003-0782-2704

Received on Tuesday, 27 September 2016 09:09:04 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 25 April 2017 10:44:45 UTC