Re: Content semantic discussion from Stroup, David on 2014-06-02 (public-digipub-ig@w3.org from June 2014)

From: Stroup, David <david.stroup@pearson.com>
Date: Mon, 2 Jun 2014 12:38:19 -0400
To: Bill Kasdorf <bkasdorf@apexcovantage.com>
Cc: Ivan Herman <ivan@w3.org>, Luc Audrain <LAUDRAIN@hachette-livre.fr>, W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-ID: <CAC2-SmAXsyGDiP+fhPgDpo3jnPJATkEU=Q-zQXUqS_Cc2qSBow@mail.gmail.com>
Its no secret based on what we (Pearson) submitted to the IDPF initially as
an educational profile for ePUB... we leveraged the @class for semantics.
There were several reasons for this most of which have already been stated
but here they are:

   - Easiest to implement...minimal restrictions on how @class could be
   used...we weren't going to get validation errors.
   - Easiest to style...we style our content based on semantics so this
   works well with CSS...using another attribute complicates CSS development.
   - Easiest to author (and style)...the authoring tools currently
   available for non-technical editors understand the @class and make it
   easily available to modify or create default patterns/templates around
   authoring. RDFa and @role are great options but to fully realize the value
   they bring it would have required far more development to implement.

I am a firm believer that we need to have a standard way to identify
semantics as it will reduce the guess work and increase implementation
overall...which will enable browsers, eReaders and accessibility tools to
leverage in a standard and consistent way.

No matter where we end up placing the semantics, if its not a dedicated
attribute, are we going to want a prefix in the naming convention to
indicate what the value is related to? I fore see multiple values from
different vocabularies existing in one attribute being confusing and hard
to parse. Thinking Out Loud: Possibly an interim solution would be to have
a naming convention with a prefix that can be applied to @role or @class or
other until a definitive (and possibly specialized) attribute is identified..

Ease of implementation and perceived value is what's going to drive
adoption.

Where ever the standard place to apply semantics eventually ends up
(ideally the same place for web and ePub), Pearson is committed to follow.



*David Stroup *Sr. Content Architect
Core Platforms & Enterprise Architecture

*https://neo.pearson.com/groups/content-architecture
<https://neo.pearson.com/groups/content-architecture> *
M: (585) 708-9651
E: david.stroup@pearson.com
Skype: xmlstroup
Google Chat: david.stroup@pearson.com


*Pearson *Always Learning
Learn more at www.pearson.com
Join PXE Nation <https://neo.pearson.com/groups/pxe-faqs>
Join Content Architecture
<https://neo.pearson.com/groups/content-architecture>


On Wed, May 21, 2014 at 2:45 PM, Bill Kasdorf <bkasdorf@apexcovantage.com>
wrote:

> Yes, precisely. I well know that @class is for styling, but given what we
> are dealing with right now we have not felt that there was another place
> these distinctions could be expressed, feeling that @role and @data are
> even more explicitly reserved for specific uses.
>
> Not to put them on the spot or anything ;-) but if I'm recalling
> correctly, both Pearson (Dave S, Paul?) and Hachette (Dave C?) wound up
> defaulting to the same strategy.
>
> I should point out that in such contexts, the structural semantic
> distinctions often DO have styling implications anyway.
>
> One such use case that I keep forgetting to bring up is "false-color
> proofing" for enabling editors and production people who are XML-averse to
> easily check whether tagging has been done properly. The classic example:
> say drug names are rendered normally in italic, but otherwise are just the
> text color (black). And there may be hundreds of them in a given title.
> Creating a "QC CSS" that makes a drug name that has been properly tagged as
> a drug name rendered as italic _and purple_ lets somebody like that just
> see at a glance if one has just been tagged as generically italic (and thus
> is black, not purple). A good QC CSS can reveal all the nuances of your
> coding to non-technical people, and is often a very useful complement to
> parsing and Schematron (we recommend all three, ideally).
>
> -----Original Message-----
> From: Ivan Herman [mailto:ivan@w3.org]
> Sent: Wednesday, May 21, 2014 4:58 AM
> To: Luc Audrain
> Cc: Bill Kasdorf; W3C Digital Publishing IG
> Subject: Re: Content semantic discussion
>
>
> On 20 May 2014, at 18:08 , AUDRAIN LUC <LAUDRAIN@hachette-livre.fr> wrote:
>
> > Hi Bill,
> >
> > Old, yes, enough to have celebrated the 10 anniversary of SGML at
> > Boston SGML96.
> >
> > Thank you for the semantic clarification of the term "semantics".
> > I agree with you and I am thankfull to the DPUB WG to contribute in this
> distinction between "structural semantic" for which the @role attribute
> seems to be recommended and the "content semantic" for which RDFa stuff can
> be used.
> >
> > I wouldn't use @class for that. Why not the reverse : CSS styling
> depending on @role value and RDFa?
>
> Right, absolutely. I know that, in general, @class can be used for other
> things but, in fact, @class is mostly used for styling and I prefer to keep
> to that.
>
> (I guess the fact that schema.org does not use @class for what we call
> here 'semantic enhancement', but uses MD/RDFa, or possibly embedded JSON-LD
> via a <script> element, shows where the preferences...)
>
> Ivan
>
>
> >
> > Best,
> > Luc
> >
> > De : Bill Kasdorf [mailto:bkasdorf@apexcovantage.com]
> > Envoyé : mardi 20 mai 2014 16:46
> > À : AUDRAIN LUC; W3C Digital Publishing IG Objet : RE: Content
> > semantic discussion
> >
> > This is a fantastic summary and overview, Luc. I didn't think you were
> old enough to know all that history! ;-) I, for one, have personally lived
> through the entire evolution you described, and you nailed it.
> >
> > My favorite line in your post is "a lossless semantic EPUB," and my
> favorite word in that line is "lossless." PRECISELY. I struggle with this
> with virtually every client I work with. It is tragic to see how much
> meaning is lost in most workflows-meaning that was there at one point,
> perhaps recreated (more work!) at a different point, and discarded in the
> end.
> >
> > One reason you have such good perspective on this is that you deal
> > with such diversity of content at Hachette. Within any of the
> > disciplines or areas of interest you describe-" cooking, wine,
> > travelling, gardening, education, law, dictionaries, and so on."-there
> > are vocabularies that are very meaningful and quite specific to each
> > of them. Those are the terms that must be preserved. (Full disclosure:
> > I stick 'em in @class attributes right now because there's no other
> > place to put them, and I hope not to get pilloried or shot by those
> > who consider that a no-no.)
> >
> > I want to repeat a distinction I made on a call a couple of weeks ago (I
> think it was the DPUB call, but it might have been the EDUPUB call): we
> have a semantic problem with the way we use the term "semantics." It causes
> no end of confusion when we use that same term to refer to distinctions
> made about the _document_ and distinctions made about the _content_. I
> suggested we use these terms:
> >
> > --Reserve the term "semantic enhancement" to mean characterization of
> the _content_. This is what microdata and RDFa are for; this goes all the
> way down to the phrase level for entities like events, names, places, etc..
> mentioned in the content.
> >
> > --Use the term "structural refinement" for distinctions made about
> components of the _document_, quite apart from the content of those
> components.
> >
> > I still think that's a good idea, but in defense of your continuing to
> stress how these things blur together, I will acknowledge that you're right
> about that, and give an obvious example. One of your examples was cooking..
> Cookbooks have chunks of content that are "ingredients." Typically they are
> structured as lists, and those lists have not only "ingredients" but
> "quantities" in them. And then there is typically a section describing the
> steps you would use to go about preparing that particular dish. And there
> are other semantically distinguished chunks as well, things like the
> cuisine, serving portions, level of difficulty, time it takes to prepare,
> etc. All that is critical information in the context of a cookbook. We
> shouldn't lose those semantic distinctions.
> >
> > Here's where I think it is very helpful to make the distinction between
> semantic enhancement and structural refinement. That list of ingredients
> may look very much like a definition list. Semantically, it's not a
> definition list. But in many many cases it would be marked up as a <dl>
> with the quantity in the <dt> and the ingredient in the <dd>. Do we need
> different names for those things? I would argue no, we just need a way to
> preserve what we _mean_ by them in this context. That's the semantic
> refinement. (BTW this is apart from the issue of how well CSS enables us to
> format the damn thing.) Because (a) recipes are only one of many
> applications for this same construct (a lab manual will have many of the
> same structures as a cookbook, but the things aren't called by the same
> names), and because (b) not all recipes use the list format for ingredients
> (the classic example: Julia Child's _Mastering the Art of French
> Cooking_--an ironic example for me to give to a Frenchman ;-)-puts the
> ingredients as marginal notes alongside the instructions, which is a kind
> of "step list").
> >
> > Separating the concepts of semantic enrichment from structural
> refinement enables the lab manual and the cookbook to use the same
> structure for a list of ingredients with quantities required for a recipe
> vs. a list of equipment and capacities required for a procedure but to call
> them by different names, and for two cookbooks to label "ingredient"s and
> "quantity"s while using completely different structures for them.
> >
> > --Bill Kasdorf
> >
> > From: AUDRAIN LUC [mailto:LAUDRAIN@hachette-livre.fr]
> > Sent: Tuesday, May 20, 2014 5:58 AM
> > To: W3C Digital Publishing IG
> > Subject: Content semantic discussion
> >
> > Hi,
> >
> > Following Ivan request in one of the last call, I've written the
> following text about my vision on "Content semantics".
> >
> > Here it is for your comments.
> >
> > Best,
> > Luc
> >
> > 1)      Publishing eventually went to structured content
> >
> > It took years to publishers to establish structured content creation
> processes, and they eventually succeeded when printed books are produced
> from or in parallels to database or XML resources.
> >
> > It all started in the 80's with SGML especially on heavily structured
> > content (industry documentation), and in publishing, dictionaries and
> > encyclopaedias, law books, .
> >
> > With the advent of XML in the late 90's, it grew larger with more and
> more types of content. Publishers have been writing DTDs for their books
> content as it went clear that composition and reusability will benefit from
> XML enabled processes : composition because of efficient batch software to
> build hundreds of pages, reusability because of ability to syndicate
> content from XML.
> >
> > One example : for novels, Hachette Livre started in 2000 to ask
> composition suppliers to deliver besides the print PDF, the XML file of the
> content using a dedicated text content DTD (similar to docbook). No invalid
> XML file was accepted on archiving and the supplier wasn't paid until the
> XML structure described perfectly the book content.
> >
> > In parallel, the 90's saw digital products development, and, even before
> EPUB, CR-ROM applications and web sites could be built quickly on XML to
> HTML conversion.
> >
> > Then came the smartphone and mobile apps where developers asked to
> publishers the book content in XML to populate the app base.
> >
> > This leads us to the present where we can say that structured content
> processes are mainstream, even if not completely generalized.
> >
> > 2)      Back to stone age in web based content?
> >
> > So where is the problem?
> >
> > Digital product can be built easily with the great Open Web Platform,
> using HTML and CSS. For novels, EPUB files are easily produced with
> automatic conversion from XML to HTML where CSS class styles directly
> derive from the XML structure. The display result is very good as far as
> styling is concerned, except on typography quality where we have lost so
> many rules enabling a rich reading experience.
> >
> > But, let's say that we have with HTML5 and CSS3 a good path for
> achieving text presentation, almost as fine as on paper. Readers can well
> understand what they read as styling bring them the meaning of different
> items of text, as on a printed page.
> >
> > But on the semantics side of the game, we have lost almost everything!
> >
> > Of course, a dynamic table of content can be built and footnotes can be
> characterized easily to enable new behaviors like pop-ups.
> >
> > But all the semantics we had in the XML vocabulary is lost, already in
> > case of novels, but insanely in case of all other kinds of vocabulary
> > we use for our content resources : cooking, wine, travelling,
> > gardening, education, law, dictionaries, and so on.
> >
> > 3)      Where to go?
> >
> > What I dream of for the future of EPUB and metadata, can be expressed in
> 2 points:
> >
> > -          A lossless semantic EPUB
> >
> > What we need to be able to do is a one to one semantic conversion from
> any XML resource to EPUB.
> >
> > With the OWP technology, it must be possible to preserve all the
> structure information from the highest level in source to the tiniest
> semantic inflexion. It is not HTML tags nor CSS classes that will fulfill
> this goal but certainly some kind of semantic tagging like RDFa, plus all
> available vocabularies inschema.org.
> > Here, no need for new standards as more or less, almost all is already
> possible.
> >
> > To produce such an EPUB would be as easy as today's XML to EPUB
> conversion, adding only the proper semantic tagging from the XML vocabulary
> to the HTML5 content documents. The result would be a perfectly structured
> EPUB with good presentation as today with CSS, plus all its initial
> semantics.
> >
> > Then, this EPUB as a published product, could become the universal
> patrimonial asset as it could be possible by construction to export back an
> XML from it.
> >
> > What a Reading System could do with that is a R&D subject where such a
> > richness of semantic could be kept hidden in many RS and blossom in
> > some, for the biggest benefit of readers.
> >
> > -          EPUB with doors and windows
> >
> > Beyond the reading experience, what use this highly enriched semantic
> EPUB could bring if it can only be visible inside reading system?
> Obviously, so many metadata on the content itself should also be made
> available for ebooks search and discovery.
> >
> > This raise the question of making some metadata inside the content
> visible to web sites.
> >
> > Difficult today where content protection is mandatory, but why not
> manage doors and windows in this ZIP bunker ?
> > Windows are already available to look from inside to external resources..
> The reverse is not so easy but there CFI have a role to play.
> >
> > But if we could add doors to get in and out with specific protocols, we
> could make most of the inside metadata useable to web sites :
> > -          global book metadata : Dublin Core or an included ONIX file
> with reviews, supporting resources as images, audio and video
> > -          chunk metadata : chapter description with keywords, exercices
> in text books with their characteristics, points of interest in travel
> guides, etc
> > -          named entities : at the tiniest level, metadata on all named
> entities in the middle of the text
> >
> > Not all of these are relevant and we have to figure out where to stop
> between the global to the detailed ones. But this should be decided by the
> publisher depending on ebooks with a permission mechanism.
> >
> >
> > To conclude, besides ONIX metadata we already use for ebooks
> distribution as separate external feeds from publishers to retailers,
> metadata inside the ebooks is IMO a growing subject for benefit in all use
> of ebooks distribution : business and innovation, reading experience and
> new usages, content creation and reuse.
> >
> >
> > Luc AUDRAIN
> > _______________________________
> > Hachette Livre,
> > Direction Innovation et Technologie Numérique Head of Digitalization
> > Téléphone : 01 41 23 63 70 Mobile : 06 48 38 21 41 11, rue Paul Bert
> > 92247 Malakoff Cedex
>
>
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> GPG: 0x343F1A3D
> WebID: http://www.ivan-herman.net/foaf#me
>
>
>
>
>
>
>
Received on Tuesday, 3 June 2014 21:26:42 UTC