RE: Content semantic discussion from Bill Kasdorf on 2014-05-20 (public-digipub-ig@w3.org from May 2014)

From: Bill Kasdorf <bkasdorf@apexcovantage.com>
Date: Tue, 20 May 2014 14:46:29 +0000
To: AUDRAIN LUC <LAUDRAIN@hachette-livre.fr>, W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-ID: <031d94d81df344f18cf3797914daaf84@CO2PR06MB572.namprd06.prod.outlook.com>
This is a fantastic summary and overview, Luc. I didn't think you were old enough to know all that history! ;-) I, for one, have personally lived through the entire evolution you described, and you nailed it.

My favorite line in your post is "a lossless semantic EPUB," and my favorite word in that line is "lossless." PRECISELY. I struggle with this with virtually every client I work with. It is tragic to see how much meaning is lost in most workflows-meaning that was there at one point, perhaps recreated (more work!) at a different point, and discarded in the end.

One reason you have such good perspective on this is that you deal with such diversity of content at Hachette. Within any of the disciplines or areas of interest you describe-" cooking, wine, travelling, gardening, education, law, dictionaries, and so on..."-there are vocabularies that are very meaningful and quite specific to each of them. Those are the terms that must be preserved. (Full disclosure: I stick 'em in @class attributes right now because there's no other place to put them, and I hope not to get pilloried or shot by those who consider that a no-no.)

I want to repeat a distinction I made on a call a couple of weeks ago (I think it was the DPUB call, but it might have been the EDUPUB call): we have a semantic problem with the way we use the term "semantics." It causes no end of confusion when we use that same term to refer to distinctions made about the _document_ and distinctions made about the _content_. I suggested we use these terms:

--Reserve the term "semantic enhancement" to mean characterization of the _content_. This is what microdata and RDFa are for; this goes all the way down to the phrase level for entities like events, names, places, etc. mentioned in the content.

--Use the term "structural refinement" for distinctions made about components of the _document_, quite apart from the content of those components.

I still think that's a good idea, but in defense of your continuing to stress how these things blur together, I will acknowledge that you're right about that, and give an obvious example. One of your examples was cooking. Cookbooks have chunks of content that are "ingredients." Typically they are structured as lists, and those lists have not only "ingredients" but "quantities" in them. And then there is typically a section describing the steps you would use to go about preparing that particular dish. And there are other semantically distinguished chunks as well, things like the cuisine, serving portions, level of difficulty, time it takes to prepare, etc. All that is critical information in the context of a cookbook. We shouldn't lose those semantic distinctions.

Here's where I think it is very helpful to make the distinction between semantic enhancement and structural refinement. That list of ingredients may look very much like a definition list. Semantically, it's not a definition list. But in many many cases it would be marked up as a <dl> with the quantity in the <dt> and the ingredient in the <dd>. Do we need different names for those things? I would argue no, we just need a way to preserve what we _mean_ by them in this context. That's the semantic refinement. (BTW this is apart from the issue of how well CSS enables us to format the damn thing.) Because (a) recipes are only one of many applications for this same construct (a lab manual will have many of the same structures as a cookbook, but the things aren't called by the same names), and because (b) not all recipes use the list format for ingredients (the classic example: Julia Child's _Mastering the Art of French Cooking_--an ironic example for me to give to a Frenchman ;-)-puts the ingredients as marginal notes alongside the instructions, which is a kind of "step list").

Separating the concepts of semantic enrichment from structural refinement enables the lab manual and the cookbook to use the same structure for a list of ingredients with quantities required for a recipe vs. a list of equipment and capacities required for a procedure but to call them by different names, and for two cookbooks to label "ingredient"s and "quantity"s while using completely different structures for them.

--Bill Kasdorf

From: AUDRAIN LUC [mailto:LAUDRAIN@hachette-livre.fr]
Sent: Tuesday, May 20, 2014 5:58 AM
To: W3C Digital Publishing IG
Subject: Content semantic discussion


Hi,



Following Ivan request in one of the last call, I've written the following text about my vision on "Content semantics".



Here it is for your comments.



Best,

Luc



________________________________

1)      Publishing eventually went to structured content

It took years to publishers to establish structured content creation processes, and they eventually succeeded when printed books are produced from or in parallels to database or XML resources.

It all started in the 80's with SGML especially on heavily structured content (industry documentation), and in publishing, dictionaries and encyclopaedias, law books, ...

With the advent of XML in the late 90's, it grew larger with more and more types of content. Publishers have been writing DTDs for their books content as it went clear that composition and reusability will benefit from XML enabled processes : composition because of efficient batch software to build hundreds of pages, reusability because of ability to syndicate content from XML.

One example : for novels, Hachette Livre started in 2000 to ask composition suppliers to deliver besides the print PDF, the XML file of the content using a dedicated text content DTD (similar to docbook). No invalid XML file was accepted on archiving and the supplier wasn't paid until the XML structure described perfectly the book content.

In parallel, the 90's saw digital products development, and, even before EPUB, CR-ROM applications and web sites could be built quickly on XML to HTML conversion.

Then came the smartphone and mobile apps where developers asked to publishers the book content in XML to populate the app base.

This leads us to the present where we can say that structured content processes are mainstream, even if not completely generalized.


2)      Back to stone age in web based content?

So where is the problem?

Digital product can be built easily with the great Open Web Platform, using HTML and CSS. For novels, EPUB files are easily produced with automatic conversion from XML to HTML where CSS class styles directly derive from the XML structure. The display result is very good as far as styling is concerned, except on typography quality where we have lost so many rules enabling a rich reading experience.

But, let's say that we have with HTML5 and CSS3 a good path for achieving text presentation, almost as fine as on paper. Readers can well understand what they read as styling bring them the meaning of different items of text, as on a printed page.

But on the semantics side of the game, we have lost almost everything!

Of course, a dynamic table of content can be built and footnotes can be characterized easily to enable new behaviors like pop-ups.

But all the semantics we had in the XML vocabulary is lost, already in case of novels, but insanely in case of all other kinds of vocabulary we use for our content resources : cooking, wine, travelling, gardening, education, law, dictionaries, and so on...

3)      Where to go?

What I dream of for the future of EPUB and metadata, can be expressed in 2 points:


-          A lossless semantic EPUB

What we need to be able to do is a one to one semantic conversion from any XML resource to EPUB.

With the OWP technology, it must be possible to preserve all the structure information from the highest level in source to the tiniest semantic inflexion. It is not HTML tags nor CSS classes that will fulfill this goal but certainly some kind of semantic tagging like RDFa, plus all available vocabularies in schema.org.
Here, no need for new standards as more or less, almost all is already possible.

To produce such an EPUB would be as easy as today's XML to EPUB conversion, adding only the proper semantic tagging from the XML vocabulary to the HTML5 content documents. The result would be a perfectly structured EPUB with good presentation as today with CSS, plus all its initial semantics.

Then, this EPUB as a published product, could become the universal patrimonial asset as it could be possible by construction to export back an XML from it.

What a Reading System could do with that is a R&D subject where such a richness of semantic could be kept hidden in many RS and blossom in some, for the biggest benefit of readers...


-          EPUB with doors and windows

Beyond the reading experience, what use this highly enriched semantic EPUB could bring if it can only be visible inside reading system? Obviously, so many metadata on the content itself should also be made available for ebooks search and discovery.

This raise the question of making some metadata inside the content visible to web sites.

Difficult today where content protection is mandatory, but why not manage doors and windows in this ZIP bunker ?
Windows are already available to look from inside to external resources. The reverse is not so easy but there CFI have a role to play.

But if we could add doors to get in and out with specific protocols, we could make most of the inside metadata useable to web sites :

-          global book metadata : Dublin Core or an included ONIX file with reviews, supporting resources as images, audio and video

-          chunk metadata : chapter description with keywords, exercices in text books with their characteristics, points of interest in travel guides, etc

-          named entities : at the tiniest level, metadata on all named entities in the middle of the text

Not all of these are relevant and we have to figure out where to stop between the global to the detailed ones. But this should be decided by the publisher depending on ebooks with a permission mechanism.


To conclude, besides ONIX metadata we already use for ebooks distribution as separate external feeds from publishers to retailers, metadata inside the ebooks is IMO a growing subject for benefit in all use of ebooks distribution : business and innovation, reading experience and new usages, content creation and reuse.


Luc AUDRAIN
_______________________________
Hachette Livre,
Direction Innovation et Technologie Numérique
Head of Digitalization
Téléphone : 01 41 23 63 70
Mobile : 06 48 38 21 41
11, rue Paul Bert 92247 Malakoff Cedex
Received on Tuesday, 20 May 2014 14:47:01 UTC