Content semantic discussion from AUDRAIN LUC on 2014-05-20 (public-digipub-ig@w3.org from May 2014)

From: AUDRAIN LUC <LAUDRAIN@hachette-livre.fr>
Date: Tue, 20 May 2014 11:58:01 +0200
To: W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-ID: <CBB672840B193740B8B735DDD60B2B6D21E10B01@HLVALCCR.intrahl.com>

Hi,

Following Ivan request in one of the last call, I've written the following text about my vision on "Content semantics".

Here it is for your comments.

Best,

Luc

________________________________

1) Publishing eventually went to structured content

It took years to publishers to establish structured content creation processes, and they eventually succeeded when printed books are produced from or in parallels to database or XML resources.

It all started in the 80's with SGML especially on heavily structured content (industry documentation), and in publishing, dictionaries and encyclopaedias, law books, ...

With the advent of XML in the late 90's, it grew larger with more and more types of content. Publishers have been writing DTDs for their books content as it went clear that composition and reusability will benefit from XML enabled processes : composition because of efficient batch software to build hundreds of pages, reusability because of ability to syndicate content from XML.

One example : for novels, Hachette Livre started in 2000 to ask composition suppliers to deliver besides the print PDF, the XML file of the content using a dedicated text content DTD (similar to docbook). No invalid XML file was accepted on archiving and the supplier wasn't paid until the XML structure described perfectly the book content.

In parallel, the 90's saw digital products development, and, even before EPUB, CR-ROM applications and web sites could be built quickly on XML to HTML conversion.

Then came the smartphone and mobile apps where developers asked to publishers the book content in XML to populate the app base.

This leads us to the present where we can say that structured content processes are mainstream, even if not completely generalized.

2) Back to stone age in web based content?

So where is the problem?

Digital product can be built easily with the great Open Web Platform, using HTML and CSS. For novels, EPUB files are easily produced with automatic conversion from XML to HTML where CSS class styles directly derive from the XML structure. The display result is very good as far as styling is concerned, except on typography quality where we have lost so many rules enabling a rich reading experience.

But, let's say that we have with HTML5 and CSS3 a good path for achieving text presentation, almost as fine as on paper. Readers can well understand what they read as styling bring them the meaning of different items of text, as on a printed page.

But on the semantics side of the game, we have lost almost everything!

Of course, a dynamic table of content can be built and footnotes can be characterized easily to enable new behaviors like pop-ups.

But all the semantics we had in the XML vocabulary is lost, already in case of novels, but insanely in case of all other kinds of vocabulary we use for our content resources : cooking, wine, travelling, gardening, education, law, dictionaries, and so on...

3) Where to go?

What I dream of for the future of EPUB and metadata, can be expressed in 2 points:

- A lossless semantic EPUB

What we need to be able to do is a one to one semantic conversion from any XML resource to EPUB.

With the OWP technology, it must be possible to preserve all the structure information from the highest level in source to the tiniest semantic inflexion. It is not HTML tags nor CSS classes that will fulfill this goal but certainly some kind of semantic tagging like RDFa, plus all available vocabularies in schema.org.
Here, no need for new standards as more or less, almost all is already possible.

To produce such an EPUB would be as easy as today's XML to EPUB conversion, adding only the proper semantic tagging from the XML vocabulary to the HTML5 content documents. The result would be a perfectly structured EPUB with good presentation as today with CSS, plus all its initial semantics.

Then, this EPUB as a published product, could become the universal patrimonial asset as it could be possible by construction to export back an XML from it.

What a Reading System could do with that is a R&D subject where such a richness of semantic could be kept hidden in many RS and blossom in some, for the biggest benefit of readers...

- EPUB with doors and windows

Beyond the reading experience, what use this highly enriched semantic EPUB could bring if it can only be visible inside reading system? Obviously, so many metadata on the content itself should also be made available for ebooks search and discovery.

This raise the question of making some metadata inside the content visible to web sites.

Difficult today where content protection is mandatory, but why not manage doors and windows in this ZIP bunker ?
Windows are already available to look from inside to external resources. The reverse is not so easy but there CFI have a role to play.

But if we could add doors to get in and out with specific protocols, we could make most of the inside metadata useable to web sites :

- global book metadata : Dublin Core or an included ONIX file with reviews, supporting resources as images, audio and video

- chunk metadata : chapter description with keywords, exercices in text books with their characteristics, points of interest in travel guides, etc

- named entities : at the tiniest level, metadata on all named entities in the middle of the text

Not all of these are relevant and we have to figure out where to stop between the global to the detailed ones. But this should be decided by the publisher depending on ebooks with a permission mechanism.

To conclude, besides ONIX metadata we already use for ebooks distribution as separate external feeds from publishers to retailers, metadata inside the ebooks is IMO a growing subject for benefit in all use of ebooks distribution : business and innovation, reading experience and new usages, content creation and reuse.

Luc AUDRAIN
_______________________________
Hachette Livre,
Direction Innovation et Technologie Numérique
Head of Digitalization
Téléphone : 01 41 23 63 70
Mobile : 06 48 38 21 41
11, rue Paul Bert 92247 Malakoff Cedex

Received on Tuesday, 20 May 2014 09:58:33 UTC