Calling for a massive revamp of Paged Media and GCPM

Hi there,

Our Web world has noiselessly changed some time ago when the Web
standards started leaving the sole domain of the Web to reach new
fields. One that's particularly interesting to me is the eBooks' world.

Working on content editors, I have always wondered why the word
processors we have on the market need to define their own proprietary
formats. Before dropping Nvu and starting BlueGriffon, I pondered
writing not a new content editor for the Web but a new word processor,
strictly HTML+CSS+MathML+SVG-based. I eventually deferred that project
but it still runs in mind from time to time because I know I _can_
do it.

Then I recently started something I absolutely need for my EPUB editor:
an importer for *.docx files. *.docx files are basically a zip
encapsulating xml-based metadata for a word document and an OfficeXML
version of the document (I'm simplifying but I'm sure you get the big
image; if you want to learn more, the spec is 5000+ pages long...).

The docx zip contains xml document instances for the main content,
the multiple headers and footers, the styles, the themes, etc.

Since many, really many book authors' and publishers' work is still
almost entirely based (good or bad) on Word formats, being able to
import/translate a format cleanly exported by Word into an xml
serialization of html5 becomes a key factor for the success of
electronic books. Ebooks have even started leaving the field of books
to reach press, magazines, knowledge management, and more. And
html5+CSS are now massively used for slideshows.

In that perspective, CSS and html5 have a few major holes I would like
to discuss here:

1. the Paged Media and Generated Content for Paged Media specs do not
    allow content-rich headers and footers at this time. This most
    certainly comes from the fact our browsers print web pages allowing
    to specify through UI the contents of 6 areas above and below the
    page content to put the URL, the title, the page number, the date,
    etc. But no real rich content.
    But if you take an average document coming from a word processor,
    for instance a contract proposal or an invoice sent by my company,
    the headers and footers are MUCH more complex than that; they
    contain real content, complex, that convey much more information than
    just pagination or document-wide metadata. Our current CSS specs
    erroneously consider headers and footers of a page are only Generated
    Content from the document, that they then belong to presentation and
    not content. They do not, they are really content, should live in the
    markup side of the document and all our history in word processing
    easily demonstrates it.
    With respect to that requirement, the generated content portions of
    the Paged Media specs seems to me completely outdated.

2. the Paged Media spec divides the margin of a page into 16 (!) areas
    where CSS can generate content from the markup through CSS rules
    only. At a time we discuss slot generation through dedicated
    at-rules and flows of content, at a time we have flexbox to nicely
    and precisely place data wherever we want, at a time we have Grid
    Layout to divide finely a layout into slots, this seems to me a
    suboptimal and too complex solution. It's unmaintainable from an
    author's perspective. It's clearly not enough to import a document
    coming from a word processor without dropping a lot of data living
    in headers and footers.

3. even if we do have header and footer elements in html5, CSS is
    currently to weak to allow authors to give them a rich presentation
    including in paged or print media, for instance allowing them to
    persist across the pages of a given section.

4. the GCPM module allows content "creation" into headers and footers
    through the 'content' property. But the functional notation content()
    defined in same spec only allows to retrieve the textual contents
    of a given element without capturing its richness. This is far from
    enough. A much better way of doing would be to define for instance a
    page header as the the 'flow-from' destination of elements carrying a
    'flow-into' property. We could also have a specific very simple
    property declaring the flow should persist from one page to the next
    ones unless that extra page sends itself elements into the
    header's flow.

5. footnotes in the GCPM seem to me a tortured solution. I agree
    footnotes are an extremely complex problem. But wait, adding a
    footnote counter is easy and we have both ::before and ::after.
    If we flow footnotes into a footnote area defined as above in 4,
    we could use ::before as the counter reference that will stay with
    the footnote's prose and ::after as the footnote's source that
    remains in the main prose. All we need for it is a way of specifying
    generated content does not flow with its parent element. And if
    your footnote is a link AND the target of that link, clicking on the
    linkified ::after will even take you to the footnote in the
    footnotes' area by pure magic. We don't need all the extra stuff the
    GCPM spec specifies.

6. similarly, bookmarks are defined by GCPM as being presentational. I
    disagree with that approach. A bookmark is clearly for me an
    annotation and annotations are content. We could use for bookmarks a
    mechanism totally similar to the one I outlined for footnotes above,
    with a slot and a flow. Simple, efficient, rich, clean.

7. the main content area of a page can easily be defined as the
    substraction of all the flow areas defining headers, footers,
    footnotes, etc from the page area (Cf. terminology in section
    3.1 of the Paged Media spec). It means that headers, footers,
    footnotes and friends can easily be defined as Exclusions to the
    page area. Of course, it is still also possible to define a flow
    for the main content of a document and send that flow to specified
    slots/areas/grid cells/... in pages.

In summary, Paged Media and Generated Content for Paged Media paved the
way. But they were never really implemented if you except YesLogic's
PrinceXML and now show their limits given the new industries using
extensively html and CSS. I am calling for a massive revamp of these
documents based on Regions, Flows, Slots, Exclusions, Grids and the
@page rule. We're just too far behind what live paged media really need
from CSS.


Received on Sunday, 13 January 2013 11:46:37 UTC