W3C home > Mailing lists > Public > public-digipub@w3.org > August 2015

Re: Prioritisation

From: Leonard Rosenthol <lrosenth@adobe.com>
Date: Thu, 6 Aug 2015 18:01:54 +0000
To: Bill McCoy <whmccoy@gmail.com>
CC: Johannes Wilm <johanneswilm@vivliostyle.com>, Kaveh Bazargan <kaveh@rivervalleytechnologies.com>, Dave Cramer <dauwhe@gmail.com>, "Richard Ishida" <ishida@w3.org>, W3C Digital Publishing Discussion list <public-digipub@w3.org>
Message-ID: <1081B840-6A74-4FA9-8224-5D206E2009C7@adobe.com>
Always nice when we agree, Bill.   So I’ll start with more places that I agree with you…

I do agree that many of the things that have been referred to as “interactive PDF” have found other outlets – such as multimedia and scripting.  And I agree that having such things be converted to PDF isn’t worth the effort at this time.

However, there are many other aspects concerning PDF production for non-printing contexts that you are overlooking and that are necessary (and even mandated by laws).  For example, standards such as PDF/A (‘a’ conformance level) and PDF/UA require that a PDF be tagged/structured – a feature that (by designs) maps well to HTML’s semantic elements.  But only a few of the tools out there will produce this structure in the output PDF – and even those that do don’t always do it in the same way.  And as long as governments and enterprises require/mandate PDF, and in one of those standards, it remains something that this group (and the industry as a whole) will need to support…(even if you only think it’s good for “replicating paper” :).

Leonard

From: Bill McCoy
Date: Thursday, August 6, 2015 at 12:35 PM
To: Leonard Rosenthol
Cc: Johannes Wilm, Kaveh Bazargan, Dave Cramer, Richard Ishida, W3C Digital Publishing Discussion list
Subject: Re: Prioritisation

I agree with Leonard about the deficiencies in today's browser printing pipelines but wanted to add a couple things:

- EPUB print-on-demand solutions are starting to appear, I saw one being offered for the Japanese market last month at Tokyo International Book Fair. There has been interest expressed in an initiative on this by folks from major print/POD players (HP, Toppan, Dai Nippon Printing, Ingram) as well as the accessibility community. No active project is yet under way in IDPF/Readium but I anticipate there may be something soon and would love to have this coordinated with this group's activities. Today's prevalent solutions in the CSS formatting space (AntennaHouse and Prince) are proprietary so having something open source, built on a browser engine, and helping to advance the relevant open standards (esp. CSS) would seem helpful. I don't think this necessarily needs to wait for other parts of the EPUB-WEB vision to be realized so it could be a good candidate for near-term efforts that would yield rapid useful results.

- I think it would make most sense for this group to focus on OWP->PDF that is good for high-quality printing... I believe the "interactive PDF" has passed it's sell-by date and fundamentally trying to map HTML forms into PDF forms, HTML JS APIs into Acrobat scripting APIs etc. seems both unlikely to be fruitful and unlikely to be very useful since we already have EPUB and are moving in the direction of EPUB-WEB.  There are some areas of functionality missing from EPUB that are present in PDF that could be helpful to OWP in general and thus in scope for EPUB-WEB and thus this group. For example, digital signatures (in the legal sense). Working on filling these gaps seems better to me than worrying about PDF as anything else than its design center and primary use case as a replica of paper.

--Bill



On Thu, Aug 6, 2015 at 8:52 AM, Leonard Rosenthol <lrosenth@adobe.com<mailto:lrosenth@adobe.com>> wrote:
>Yes, except that the HTML-to-PDF renderer present in browsers is used by more than just us book people,
>which gives it slightly higher chances of technical survival over the next few years.
>
Unfortunately, those converters produce output that is useless for anything other than printing (and in some cases, not even that).   All sense of semantics and non-static content have been lost and pagination is arbitrary and uncontrollable (which is what started this thread, IIRC).  it’s also quite unclear when the process should proceed – when are scripts “done” and the content is “ready”.   We recently undertook a detailed product/technology comparison in this area, so this is not speculation but fact.

This (OWP->PDF) is one of the areas that has brought us back into active participation as our customers and the industry as a whole is being damaged by the lack of standardization (or even implementation!) in this area.  There are a lot of great (draft!) specs out there that could potentially resolve some of these issues, but it will require a group (such as this one) to pick the one(s) that we feel are solving the correct problems, validate that the concerns of all constituents (not just book and magazine publishers) are met, and work to see them implemented in the key UA technologies.  Not a “quick fix” problem – but the sooner we start, the sooner it will be resolved.

Leonard

From: Johannes Wilm
Date: Thursday, August 6, 2015 at 11:40 AM
To: Kaveh Bazargan
Cc: Dave Cramer, Richard Ishida, W3C Digital Publishing Discussion list
Subject: Re: Prioritisation
Resent-From: <public-digipub@w3.org<mailto:public-digipub@w3.org>>
Resent-Date: Thursday, August 6, 2015 at 11:41 AM



On Thu, Aug 6, 2015 at 4:49 PM, Kaveh Bazargan <kaveh@rivervalleytechnologies.com<mailto:kaveh@rivervalleytechnologies.com>> wrote:


On 6 August 2015 at 14:58, Johannes Wilm <johanneswilm@vivliostyle.com<mailto:johanneswilm@vivliostyle.com>> wrote:


On Thu, Aug 6, 2015 at 1:05 PM, Kaveh Bazargan <kaveh@rivervalleytechnologies.com<mailto:kaveh@rivervalleytechnologies.com>> wrote:
Hi Johannes

I am flattered by your comprehensive reply. My comments regarding TeX are below, but I might not have explained myself well...

I am not suggesting anyone should use TeX code, or even be aware that TeX/LaTeX is involved. The point is that it is a back end automated page make up engine. So XML/HTML can be converted to PDF very fast and at very high quality with the TeX engine invisibly doing the work.

Ok, so you are proposing converting XML to LaTeX and Epub/HTML on a backend system? The main problem with that is the conversion mechanisms just about always need human intervention and that it's hard to impossible to get XML input files from authors.

But conversion of XML to a fixed layout view is same as HTML to fixed layout is it not, which is the aim of this group? Would that not need the same human intervention? The human intervention is needed because publishers want the same look as a journal they have had for decades. With a little modification the process can be entirely automated.

Yes, except that the HTML-to-PDF renderer present in browsers is used by more than just us book people, which gives it slightly higher chances of technical survival over the next few years.





Here are my points, distilled:

  *   I like the idea of HTML/CSS/Javascript creating fixed pages to be read on screen with all kinds of interactivity
  *   I still question trying to create footnotes, floating figures and tables, and typographic niceties which have primarily evolved for print on paper, being done in the browser. To me, floating items only apply to print, so no interactivity is not needed. Why not pass the info to an engine that knows how to do it well?

Is there not also a point in having footnotes and floating figures in ebooks (and have those still work when the user changes the font size level)?

Floats are a matter of opinion. I would say no, I don't want to flick to the next page and back again. I want to hover or click and fig pops up. Floats have been needed because of the obvious limitations of print. My preference for footnotes is similar, i.e. click or touch screen to get more info. But that can be a user's decision. We should have renderers that produce whatever a user prefers.

Agreed, this should be user preference. I think on scientific ebooks, for example, I would still like real footnotes at the bottom and I think the lack of good footnote support is why many still use PDFs instead of epubs for certain types of texts.


  *   The problem of floating items, complex math, large footnotes that need to break across pages, and many other complex pagination problems have already been solved in TeX. These are not trivial problems and I worry about this working group reinventing the wheel, by starting to specify the basics of pagination from scratch. In my opinion, in the end the only way to solve the problem is to rewrite TeX in JavaScript!

I have also been thinking of LaTeX in Javascript. But as far as I can tell, that in 2015 that would still be too slow. TeXLive is a few GB large, and if the user should wait for a few GB to download before the page is rendered, that likely wouldn't work. In a few years, when a few GB is nothing and processors are faster, this may be a viable alternative.

You got it wrong here, Johannes. ;-) TeXLive contains every possible style file you might want – 10,000s. The basic TeX compiler is only 500K! Remember it is 35 years old, so had to run on mainframes, which is why it is fast. Even with a few basic style files I don't think it would exceed 2Mb for an automated pagination system.

Right. But you will likely need some of those packages, if you will want to let people render their documents.

The use case I had was that I had people who were to write scientific articles and I was thinking of how I could get them to render the documents themselves, without having to go through the process of installing LaTeX and doing weird things on the command line. So in that case I couldn't know exactly what packages my users would need.

So maybe this is indeed possible if one locks everyone down to a small subset of everything. I just haven't seen that in action ever. Linux always seems to ask me to just install another 600 MB of files whenever I try to compile the file of someone else.


Forgive me, but storing in several formats is absolutely the worst thing you can do. What if there is a difference between the files? Which one is right? Who knows? No one! Already there is a problem brewing. Go to any open access journal (PeerJ, Plos, Frontiers etc) and pick a paper. You will find the paper has a DOI – the definitive version of record. But which *format* is the version of record, the XML, the HTML or the PDF. None of the publishers have the courage to nominate one!! Of course they all know it should be the XML but only the PDF has been proofread. No one looks at XML – except me!!

I know I know. Among software developers we have similar sayings when it comes to data.

And still: If you have the same program saved in 1996 both on a CD-ROM and on a 3.5" disc, you can read both and they differ, you may have a problem of figuring out which one is "better". But if you only save to one of them, you may end up not being able to access it at all. So it ends up being wiser to make multiple copies in different formats anyway, even if it may put you in that dilemma.

I assume that the same reason people not only maintain digital copies, but also paper copies (on special paper) that are then stored on different continents.


I think that the excitement we are all experiencing with HTML (including myself) is going to have bad consequences in future unless we set some really firm rules. It is now the 350th anniversary of the first scholarly journal. We can still read it with no ambiguity. So it has been amazingly future-proof. 350 years from now, will scholars be able to read our scientific literature without ambiguity? I doubt it!

It's not impossible. But it's not just HTML. It's everything about our times that requires fast changes which means that none of us can be sure we will even be able to read this conversation in 5 years time.



--
Kaveh Bazargan
Director
River Valley Technologies
@kaveh1000
+44 7771 824 111<tel:%2B44%207771%20824%20111>
www.rivervalleytechnologies.com<http://www.rivervalleytechnologies.com/>
www.bazargan.org<http://www.bazargan.org/>


Received on Thursday, 6 August 2015 18:02:28 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 6 August 2015 18:02:30 UTC