RE: schema.org and ONIX... from Bill Kasdorf on 2014-04-09 (public-digipub-ig@w3.org from April 2014)

From: Bill Kasdorf <bkasdorf@apexcovantage.com>
Date: Wed, 9 Apr 2014 14:48:01 +0000
To: "Madans, Phil" <Phil.Madans@hbgusa.com>, Ivan Herman <ivan@w3.org>
CC: Luc Audrain <LAUDRAIN@hachette-livre.fr>, W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-ID: <81aac59516ac4636988d90b9609f7571@CO2PR06MB572.namprd06.prod.outlook.com>
Thanks, Phil, very helpful as always.

This thread has turned a lightbulb on for me (in fact a couple).

First: we are really talking about two distinctly different use cases here:
--Transmitting publication-level metadata (for which a subset of ONIX in schema.org is what we are looking at doing).
--Embedding subject metadata at all levels in a publication to make it discoverable and enable drilling down to points within a publication based on that subject metadata (for which I was suggesting Thema would be the place to start, using the schema.org mechanism).

Those are really two related but different things, and I think they are both important to do.

Thanks so much for your clarification on Thema! I have not studied it, and I had always understood it to be much simpler than BISAC. Glad to be corrected on that!

One important issue we have here is what I think of as the "comprehensive vs. concise" dilemma.

--I personally always gravitate to "comprehensive" solutions, e.g. "publishers want more precise descriptions, which require much more extensive vocabularies"; "different types of publishers use different schemes and vocabularies (most of them extensive for the above reason) and we need to let them do that"; "keywords, without a controlled vocabulary, are something many publishers want to use"; etc. Let a thousand flowers bloom! ;-) (AKA "good luck with that.")

--The problem is that from the point of view of any receiving system, this quickly becomes unworkable. Systems want things that are clear, specific, and simple so that functionality can be reliably delivered in a programmatic fashion. That's why schema.org vocabularies are typically so much more bare-bones than the vocabularies used by the various interest groups (book publishers, magazine publishers, educational publishers, news publishers, journal publishers, etc.). The receiving system says "don't tell me what I _might_ get, tell me, if you want me to do X, what I _will_ get."

A classic example for which I must assume at least part of the blame: the metadata model in EPUB 3. That model can actually _already_ express all of the above. No problem. It's already there. But guess what? No reading system that I know of actually does _anything_ with that metadata. Being Mr. Idealistic, I still hope they will. And within certain closed systems (known sources, known recipients, agreed-upon process and vocabulary) it can work just fine. But if I had held my breath for our wonderful <meta> and prefix mechanism to get any actual use in the real world I would have been dead long ago. ;-)

--Bill K

-----Original Message-----
From: Madans, Phil [mailto:Phil.Madans@hbgusa.com] 
Sent: Wednesday, April 09, 2014 10:09 AM
To: Bill Kasdorf; Ivan Herman
Cc: Luc Audrain; W3C Digital Publishing IG
Subject: RE: schema.org and ONIX...

As far as a separate meeting to discuss.  I am out most of next week but have some availability Tuesday and Wednesday.  Otherwise I'll be back on the 22nd and free after that.

A couple of other points.  ONIX is a message transmitted among trading partners, so it does mostly reside in those databases.  Also ONIX needs to be parsed.  A lot of the data is transmitted using code lists, including BISAC Categories.  You can the literals if you want, of course. One of the issues with ONIX is that ONIX records vary wildly by sender. By the way, ONIX for Books is only one of the available ONIX messages.  There is ONIX for Subscription Products and ONIX for Licensing Terms and Rights. But Bill's point is spot on.  There is no metadata scheme used by Publishing as a whole.

Bill, maybe I misstated my thoughts on Thema.  Thema is not more bare bones than BISAC categories. BISAC has 3822 codes.  Thema has 2497 codes plus another 2000 qualifiers for geography, etc. (Thanks, Dave Cramer, for counting:)).  It is actually more complex than BISAC in that sense.  There are mappings from the existing Subject Classifications to Thema, but they are necessarily high level and so even less granular. This is not a push for BISAC by any  means.  I don't think any of the Subject Classifications are what we are looking for.  They are all good for what they do,  I just don't think they are what we want. Although when we were talking in BISG about creating a new vocabulary more geared to online search, Google was mentioned as having a very good one, which makes sense.  We never went further in the conversation and decided to create a Best Practice for Keyword creation instead--which should be published in the next month or two.

Keywords should be part of our discussion.  There is going to be a lot of activity around Keywords here in the US very shortly. Book Publishers are looking at Keywords to help search and discovery.

Phil

------------------------------------------------------------
Phil Madans | Executive Director of Digital Publishing Technology | Hachette Book Group | 237 Park Avenue NY 10017 |212-364-1415 | phil.madans@hbgusa.com

-----Original Message-----
From: Bill Kasdorf [mailto:bkasdorf@apexcovantage.com]
Sent: Tuesday, April 08, 2014 5:51 PM
To: Ivan Herman
Cc: Luc Audrain; W3C Digital Publishing IG
Subject: RE: schema.org and ONIX...

I will have to comment later on the meatier parts of this message, but:

--Re "We should not underestimate the amount of work": This is why I was suggesting starting with Thema. It is actually just a vocabulary for subject classifications, so it probably just pertains to an already-existing property of schema.org. What I was hearing from several of my interviews was the need to associate subject metadata below the level of the publication, which schema.org gets us (remember not all of these publications are "on the Web" thought they should still be able to use the OWP). As Phil Madans pointed out, Thema is pretty "bare bones" compared to BISAC, but I would suggest that that's a virtue in this context. BISAC is so huge and complex that publishers often don't "get it right" and recipients like Bowker and Nielsen feel they have to "fix" it (Apex has done this work for both of them for many years). Thema can't describe things at as meaningful a level of detail but on the other hand it would be easy to implement and has the big virtue of being a long-needed global subject vocabulary. And compared to ONIX: well, there's another gigantic set of metadata; Thema is just one tiny slice of what is in ONIX. It's not an either/or; Thema (that is, subject classifications in general) is one of many things that ONIX accommodates, but ONIX is not the _only_ place Thema (or BISAC, or BIC, etc.) are used. Strikes me as a good place to start. PLUS (here's a big one): ONIX (as we are normally thinking about it) is just for BOOKS!!!! (It's supply chain metadata, a messaging format for the book supply chain.) I keep pointing out that we are talking about PUBLICATIONS. Journals and magazines and newspapers and corporate publications etc. don't know from ONIX, they have their own schemes. But I think Thema subject classifications might be useful to them as well (e.g. I have gotten IPTC interested in it; their news schemes are not the same thing).

--Re timing of a call: I'm back next Tuesday and am available the rest of next week and all the following week (gone again most of the last week of the month). My main concern is that I would prefer this NOT be discussed in detail in this coming Monday's call because I will not be able to join that one.

-----Original Message-----
From: Ivan Herman [mailto:ivan@w3.org]
Sent: Tuesday, April 08, 2014 5:32 PM
To: Bill Kasdorf
Cc: Luc Audrain; W3C Digital Publishing IG
Subject: Re: schema.org and ONIX...

Wow, I see I have did strike some chord here:-) which is great.

On a very practical level: yes, I believe having a separate call discussing this would be good and useful. Like Bill, I am out this week; being at the WWW2014 conference in Seoul is obviously an obstacle (as an aside, I will speak about digital publishing this afternoon as well as on Friday on another local event, so continue doing my preaching:-). I will also have some days off around Easter week-end. When could we, roughly have a call? We could set up a doodle if we have some available periods: next week, the week after, both, neither?

I cannot judge the THEMA/ONIX issue, I leave this to you guys. My question is different, though. Where do ONIX data reside these days? As I said, if it is hidden in databases only, then it is invisible to Google, hence schema.org may be useless. Put it another way, is there enough pages on the Web, usually crawled by Google that does or may include ONIX data? I would certainly hope so, but we have to be sure (and you have to tell me...).

Another point worth knowing about. When schema.org came about, it was focused on HTML pages that use microdata syntax to add schema.org terms (RDFa Lite followed after a while). This is of course possible, but, for many sites, this was a bit awkward: systems may have that type of metadata in databases with the HTML pages generated automatically, and artificially adding microdata to the pages was an extra hassle. As a result, about a year ago, schema.org added the possibility to add JSON-LD into an HTML page using a special <script> tag. That made the life for such systems way easier and I suspect that this is also something that this industry may take an advantage of. (Schema.org has recently renewed their pages with examples in three syntaxes everywhere; eg, scroll to the bottom of [1].)

Finally, we have to realize one more thing. The work to be done is not 'simply' to convert a mini-ONIX into schema.org. The work is to harmonize this, whenever possible, with what is already in schema.org (see [1] below) and add the missing properties and classes or modify the description of existing ones. We should not underestimate the amount of work...

Cheers

Ivan


[1] http://schema.org/Book



On 09 Apr 2014, at 24:10 , Bill Kasdorf <bkasdorf@apexcovantage.com> wrote:

> Just going through the responses . . . and as for this one, regrettably, Luc, I will not be able to attend LBF this year. So if you've been looking for me, you can stop trying . . . ;-) but I would love to talk with you about this in any case. BTW I will have to miss the DPUB call next week.
>
> -----Original Message-----
> From: AUDRAIN LUC [mailto:LAUDRAIN@hachette-livre.fr]
> Sent: Tuesday, April 08, 2014 4:12 AM
> To: Ivan Herman
> Cc: Bill Kasdorf; W3C Digital Publishing IG
> Subject: Re: schema.org and ONIX...
>
> Hi Ivan and Bill,
>
> That's a very good exercise and I will share thoughts with Bill at London Book Fair if possible.
> I'm really interested as I'm wondering what it will bring for more ebooks discoverability on the Web beyond the ONIX feeds we provide already to distributors and digital bookstores.
>
> Best,
> Luc
>
>
>> Le 8 avr. 2014 à 05:24, "Ivan Herman" <ivan@w3.org> a écrit :
>>
>> Bill,
>>
>> I am currently at a Linked Data Workshop at a conference in Seoul, which had a keynote from R. Guha, who is, in some sense, the "father" of schema.org. Listening to him (combining also with my past experience), and also referring to the note I sent around earlier this morning[1] I am more and more serious in thinking that a stripped-down version of ONIX defined in schema.org might be a great idea. Of course, we have to see whether there is a business interest and business case for this: is there a use case for publishers as well as for search engines? But if the answer is yes on both, than this may be an important thing to do.
>>
>> I do know Guha personally relatively well, as well as Dan Brickley, who is the other person running schema.org's vocabulary development. I would be happy to make the links and go into the discussions but, of course, the question is whether publishers, as well as institutions like Bowker, would be interested by something like that. I think that clarifying this, ie, set up the use cases, would be perfectly in line with the IG's charter (although we probably would have to spawn a different group to make the specification itself, but that is all right.)
>>
>> What do you think?
>>
>> Ivan
>>
>> [1] http://www.publishersweekly.com/pw/by-topic/international/london-book-fair/article/61722-london-book-fair-2014-publishers-and-internet-standards.html
>>
>> ----
>> Ivan Herman, W3C
>> Digital Publishing Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> GPG: 0x343F1A3D
>> FOAF: http://www.ivan-herman.net/foaf
>>
>>
>>
>>
>>


----
Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
FOAF: http://www.ivan-herman.net/foaf






This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
Received on Wednesday, 9 April 2014 14:48:31 UTC