Content semantic discussion

Hello All,

A little delayed in posting this...

This posting is music to my ears!  In 1992, I presented at the SGML conference in the Boston area. The vision was that richly marked up content could be made accessible to persons who are blind.  Charles Goldfarb came up to me and said, “finally SGML is being used to do something other than document weapons of mass destruction.”

The vision is true today, but the audience is far larger than I ever envisioned.

It is not just for blind folks, but persons with all kinds of disabilities; the dyslexic population can benefit greatly and there are other groups with learning disabilities and other learning differences who can benefit by alternative presentations.

I believe the market can grow by at least 10 to 20% by offering publications that can be presented with audio synchronized with the visual  presentation. (Audio synced with refreshable braille is also achievable.) The dual modality output can be accomplished in a variety of ways using EPUB 3 today. The reason for the potential growth in the market is because of the increased comprehension offered by the dual modality option.  The increased comprehension makes reading joyful for a wide range of people. 
Best
George



From: AUDRAIN LUC [mailto:LAUDRAIN@hachette-livre.fr] 
Sent: Tuesday, May 20, 2014 3:58 AM
To: W3C Digital Publishing IG
Subject: Content semantic discussion

Hi,

Following Ivan request in one of the last call, I've written the following text about my vision on "Content semantics".

Here it is for your comments.

Best,
Luc

________________________________________
1) Publishing eventually went to structured content

It took years to publishers to establish structured content creation processes, and they eventually succeeded when printed books are produced from or in parallels to database or XML resources.

It all started in the 80’s with SGML especially on heavily structured content (industry documentation), and in publishing, dictionaries and encyclopaedias, law books, …

With the advent of XML in the late 90’s, it grew larger with more and more types of content. Publishers have been writing DTDs for their books content as it went clear that composition and reusability will benefit from XML enabled processes : composition because of efficient batch software to build hundreds of pages, reusability because of ability to syndicate content from XML.

One example : for novels, Hachette Livre started in 2000 to ask composition suppliers to deliver besides the print PDF, the XML file of the content using a dedicated text content DTD (similar to docbook). No invalid XML file was accepted on archiving and the supplier wasn’t paid until the XML structure described perfectly the book content.

In parallel, the 90’s saw digital products development, and, even before EPUB, CR-ROM applications and web sites could be built quickly on XML to HTML conversion.

Then came the smartphone and mobile apps where developers asked to publishers the book content in XML to populate the app base.

This leads us to the present where we can say that structured content processes are mainstream, even if not completely generalized.

2) Back to stone age in web based content?

So where is the problem?

Digital product can be built easily with the great Open Web Platform, using HTML and CSS. For novels, EPUB files are easily produced with automatic conversion from XML to HTML where CSS class styles directly derive from the XML structure. The display result is very good as far as styling is concerned, except on typography quality where we have lost so many rules enabling a rich reading experience.

But, let’s say that we have with HTML5 and CSS3 a good path for achieving text presentation, almost as fine as on paper. Readers can well understand what they read as styling bring them the meaning of different items of text, as on a printed page.

But on the semantics side of the game, we have lost almost everything!

Of course, a dynamic table of content can be built and footnotes can be characterized easily to enable new behaviors like pop-ups.

But all the semantics we had in the XML vocabulary is lost, already in case of novels, but insanely in case of all other kinds of vocabulary we use for our content resources : cooking, wine, travelling, gardening, education, law, dictionaries, and so on…
3) Where to go?

What I dream of for the future of EPUB and metadata, can be expressed in 2 points:

- A lossless semantic EPUB

What we need to be able to do is a one to one semantic conversion from any XML resource to EPUB. 

With the OWP technology, it must be possible to preserve all the structure information from the highest level in source to the tiniest semantic inflexion. It is not HTML tags nor CSS classes that will fulfill this goal but certainly some kind of semantic tagging like RDFa, plus all available vocabularies in schema.org. 
Here, no need for new standards as more or less, almost all is already possible. 

To produce such an EPUB would be as easy as today’s XML to EPUB conversion, adding only the proper semantic tagging from the XML vocabulary to the HTML5 content documents. The result would be a perfectly structured EPUB with good presentation as today with CSS, plus all its initial semantics. 

Then, this EPUB as a published product, could become the universal patrimonial asset as it could be possible by construction to export back an XML from it.

What a Reading System could do with that is a R&D subject where such a richness of semantic could be kept hidden in many RS and blossom in some, for the biggest benefit of readers…

- EPUB with doors and windows

Beyond the reading experience, what use this highly enriched semantic EPUB could bring if it can only be visible inside reading system? Obviously, so many metadata on the content itself should also be made available for ebooks search and discovery.

This raise the question of making some metadata inside the content visible to web sites. 

Difficult today where content protection is mandatory, but why not manage doors and windows in this ZIP bunker ? 
Windows are already available to look from inside to external resources. The reverse is not so easy but there CFI have a role to play.

But if we could add doors to get in and out with specific protocols, we could make most of the inside metadata useable to web sites : 
- global book metadata : Dublin Core or an included ONIX file with reviews, supporting resources as images, audio and video
- chunk metadata : chapter description with keywords, exercices in text books with their characteristics, points of interest in travel guides, etc
- named entities : at the tiniest level, metadata on all named entities in the middle of the text

Not all of these are relevant and we have to figure out where to stop between the global to the detailed ones. But this should be decided by the publisher depending on ebooks with a permission mechanism.


To conclude, besides ONIX metadata we already use for ebooks distribution as separate external feeds from publishers to retailers, metadata inside the ebooks is IMO a growing subject for benefit in all use of ebooks distribution : business and innovation, reading experience and new usages, content creation and reuse.


Luc AUDRAIN
_______________________________
Hachette Livre,
Direction Innovation et Technologie Numérique
Head of Digitalization
Téléphone : 01 41 23 63 70
Mobile : 06 48 38 21 41
11, rue Paul Bert 92247 Malakoff Cede

Received on Monday, 2 June 2014 15:24:18 UTC