W3C home > Mailing lists > Public > public-digipub-ig@w3.org > September 2016

RE: "Show me the metadata!" :), was Re: Rough sketch for WP

From: Siegman, Tzviya - Hoboken <tsiegman@wiley.com>
Date: Mon, 26 Sep 2016 15:44:02 +0000
To: Marcos Caceres <marcos@marcosc.com>, Baldur Bjarnason <baldur@rebus.foundation>, Ivan Herman <ivan@w3.org>
CC: "Cramer, Dave" <dave.cramer@hbgusa.com>, Michael Smith <mike@w3.org>, "W3C Digital Publishing IG" <public-digipub-ig@w3.org>, Peter Krautzberger <peter.krautzberger@mathjax.org>
Message-ID: <5e6b1d5cce634f7b8c6484fd2189b9c8@AUS-WNMBP-005-n.wiley.com>
Hi Marcos,

In addition to what Baldur wrote about some metadata that might be familiar, I want to highlight a few other issues:

There are a lot of things that are called "metadata" in publishing and in things that are more traditionally Webby.

1. In the EPUB space, we have allowed for a lot of things to be included in the "package". As Baldur pointed out, that's really nice, but it doesn't do much for us. Most metadata picked up by reading systems and reading apps is provided by some other source. So much of this in the book industry is what we call supply chain metadata. For books, this is usually captured in the ONIX format [1], which has more information than we would ever want to touch in a WP (or whatever we call it). Most of it is not relevant to what we are doing here. For example, the publisher or author sets the price on a publication in their ONIX feed, and that information is available long before the book is published. That is one of the mechanisms that enables you, the buyer/reader/user to purchase the book before it is actually published. The publisher also has the ability to change the price because the author is speaking at a conference, so the publisher releases a minor update to the ONIX file, not the publication. After the conference, the feed is updated again, and the price changes again. 

2. In EPUB 3.1, we worked on distilling what the "fundamental" metadata that MUST be in the actual EPUB Package. We were able to come to consensus around a surprisingly large number or fields. Many of us thought that the only information a reading system (user agent) would need is Identifier and last modified date (a method of local versioning). Based on the identifier, the system could reach into associated metadata, such as ONIX or something else, and get everything else needed. We discovered that many systems rely on a lot more for very immediate local rendering [2]. Is this ideal? Perhaps not, but it is what exists in the world today. In talking to the developers we found that this is largely because the line content and metadata is a thin one. Is a cover content? Is it metadata? Is the title of a publication content? Users want to access the covers, titles, subject categories immediately and as part of their personal UX. Language information is perhaps the piece of metadata that I overlooked most. If I don't know that my book is in a language or dialect that I can understand, how can I sort my library? This may indeed be information supplied by an external file, but UAs grab it as part of the content.

3. In the scholarly publishing world, the line between content and metadata is further blurred. It might be obvious to those of us in the world of HTML that the title of an article should be tagged as <h1>, but what about the subtitle? How do I tag author names? All of this information must be displayed, not just tagged. How do I tag this information in a way that makes it searchable in the NIH database? This might not sound unique to Digital Publishing and look a lot like issues that have plagued bloggers and those who have pondered the outline algorithm for years. We welcome those solutions and hope to build on them. But, I'd like to outline the kind of complexities that we face and would be happy to show sample files in a smaller setting. For now, most publishers work with a model that is compliant with the JATS tag suite [3]. You'll notice that this is XML, which is fine, but for it to work on a website, there has to be a transform to something else (HTML, PDF, etc). That something else has no standardization. You'll also notice that the <article-meta> and the article are separate. This means that some basic information, like the title get repeated. That is kind of annoying. Metadata in this world also includes rather detailed information such as author affiliations. Does this means the affiliation of the author at the time of publication? What happens if the author transfers from one university to another during the peer review process? Should the affiliation change in the article at the time of publication? This requires more than just an element or property in HTML. I don't think we should attempt to make decisions about this level of granularity, but we should make it possible for publishers and authors to do so. I would be happy to talk to you about how we deal with this at my company (Wiley). Another issue that I suspect is near and dear to the hearts of many is how to convey whether an article is open access and what type of access is allowed.  Wouldn't you prefer to know about that if asked to review an article for one of the evil publishers?


[1] http://www.editeur.org/14/code-lists/

[2] http://www.idpf.org/epub/31/spec/epub-packages-20160906.html#sec-pkg-metadata

[3] http://jats.niso.org/ 

Tzviya Siegman
Information Standards Lead

-----Original Message-----
From: Marcos Caceres [mailto:marcos@marcosc.com] 
Sent: Friday, September 23, 2016 6:05 AM
To: Baldur Bjarnason; Ivan Herman
Cc: Cramer, Dave; Michael Smith; W3C Digital Publishing IG; Peter Krautzberger
Subject: Re: "Show me the metadata!" :), was Re: Rough sketch for WP

On September 23, 2016 at 3:12:33 AM, Baldur Bjarnason (baldur@rebus.foundation) wrote:
> I figured I could do a quick overview of how the existing ePub 
> ecosystem handles metadata, with screenshots. This focuses on trade publishing so YMMV.

This is awesome! Thank you.  

> The first thing to note is that the information below applies only to side-loaded ebooks.  
> For ebooks bought through the retail channel, retail-level metadata 
> often overrides embedded metadata.
> And it does so for a good reason:
> A large portion of the trade publishing industry generally doesn’t 
> embed any metadata to speak of, beyond just the title and author. At 
> least one of the big trade publishers as a policy doesn’t even embed the cover in the ebook.
> This, in turn, is also for a good reason:
> Embedded metadata is a huge pain to manage once you have more than a 
> few titles. This is caused by the very portable and packaged nature of 
> ePubs (packaged files are a logistical nightmare at scale). Packaged 
> files combined with a clunky retail distribution system means that an ebook, once made, is hardly ever altered or updated, no matter what happens.
> This makes managing their embedded metadata very tricky once you’ve 
> got more than a few titles.

This should be reflected in the use cases document.  

> Retail metadata is generally much easier. If your catalogue is big, 
> you basically give the retailer an XML file with all of the metadata 
> for your catalogue (usually ONIX, though other formats might be in use 
> somewhere). If your catalogue is small, the web UIs for managing 
> retail metadata are generally better designed and more easily 
> understandable than using existing ePub editors or (shudder) hacking 
> the ePub’s XML and then re-uploading the packaged file to all of the 
> various retail websites you are selling through. Having minimal embedded metadata and focusing on retail metadata makes a lot of logistical sense for publishers.
> Finally, most reading systems don’t offer the level of metadata detail 
> in their displays as Readium does, even the systems that are based on 
> Readium (such as ADE for iOS). That’s because the reading systems tend to care more about reader-oriented metadata (e.g.
> organising their books into collections) or contextual data (e.g. 
> selling you some other book to read) than about publisher supplied 
> metadata. Partly this is a side effect of how retail-oriented 
> publisher-supplied metadata is: they are of little use to a reader organising their collection.

So, depending on how we focus this effort, we should focus on metadata that is primarily useful to users to manage their collections. 

Would others agree?

> Anyway, the first two examples are from Marvin, which caters to power 
> users. The example book is The Bleeding Edge by Bob Hughes. You’ll 
> note that despite their focus on expert users, the app doesn’t expose ISBNs or URNs. It does let the user edit the metadata, though.

That's pretty neat. Something that could be done in app... but if we do find critical metadata for users, they should be able to manage it (possibly allow the app to manage it)... that's a good requirement right there. 

> The next one is from Kobo, which failed at side-loading Bob’s book, 
> despite multiple attempts, so I’m using an example from a purchased book, Cat’s Eye, by Margaret Atwood.
> Next is the only example that exposes the URN/ISBN: ADE for iOS.
> Then there’s the most metadata rich display I found in iBooks (doesn’t show much at all).  
> The best designed metadata UI I found was in Aldiko for iOS.
> Gerty, by the creator of Marvin, has a very user-oriented metadata view.
> Finally, Bluefire, which has the same level of detail as Aldiko in a 
> slighly klunkier UI.

Thanks again! These should definitely end up in the use cases doc! 

Received on Monday, 26 September 2016 15:44:38 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 25 April 2017 10:44:45 UTC