RE: Prioritisation from Bill Kasdorf on 2015-08-06 (public-digipub@w3.org from August 2015)

From: Bill Kasdorf <bkasdorf@apexcovantage.com>
Date: Thu, 6 Aug 2015 14:53:09 +0000
To: Kaveh Bazargan <kaveh@rivervalleytechnologies.com>, Johannes Wilm <johanneswilm@vivliostyle.com>
CC: Dave Cramer <dauwhe@gmail.com>, Richard Ishida <ishida@w3.org>, "W3C Digital Publishing Discussion list" <public-digipub@w3.org>
Message-ID: <CO2PR06MB572E774F8B70458D3069122DF740@CO2PR06MB572.namprd06.prod.outlook.com>
Just pointing out that there are many sophisticated pagination engines out there, TeX is just one example. Very sophisticated, automated, complex page makeup has been done on proprietary systems since the 1990s. Several such systems are still currently in wide use. Many of the issues that we're addressing in the context of the Open Web Platform today were considered solved problems decades ago in those systems. That doesn't mean we don't want to be able to provide that kind of sophisticated page makeup natively on the Web and in Web-based technologies, _without_ requiring separate software systems (with their attendant specializations, learning curves, system implementations, maintenance, etc.)—Bill Kasdorf

From: Kaveh Bazargan [mailto:kaveh@rivervalleytechnologies.com]
Sent: Thursday, August 06, 2015 7:06 AM
To: Johannes Wilm
Cc: Dave Cramer; Richard Ishida; W3C Digital Publishing Discussion list
Subject: Re: Prioritisation

Hi Johannes

I am flattered by your comprehensive reply. My comments regarding TeX are below, but I might not have explained myself well...

I am not suggesting anyone should use TeX code, or even be aware that TeX/LaTeX is involved. The point is that it is a back end automated page make up engine. So XML/HTML can be converted to PDF very fast and at very high quality with the TeX engine invisibly doing the work.

Here are my points, distilled:

  *   I like the idea of HTML/CSS/Javascript creating fixed pages to be read on screen with all kinds of interactivity
  *   I still question trying to create footnotes, floating figures and tables, and typographic niceties which have primarily evolved for print on paper, being done in the browser. To me, floating items only apply to print, so no interactivity is not needed. Why not pass the info to an engine that knows how to do it well?
  *   The problem of floating items, complex math, large footnotes that need to break across pages, and many other complex pagination problems have already been solved in TeX. These are not trivial problems and I worry about this working group reinventing the wheel, by starting to specify the basics of pagination from scratch. In my opinion, in the end the only way to solve the problem is to rewrite TeX in JavaScript!
  *   Another problem I have is holding all our information in HTML as opposed to XML. I worry about how clean and semantic the content will be. after all HTML was designed to be forgiving, so even bad content will look good. We are all excited about the amazing gizmos in html and how the browser is the new publishing model, but what about 10, 50 or 100 years time? Will these html files still make sense? What happens when the browser is superseded? I am all for html tools and interactivity, but I suggest the definitive content should be XML, not HTML.

On 5 August 2015 at 23:34, Johannes Wilm <johanneswilm@vivliostyle.com<mailto:johanneswilm@vivliostyle.com>> wrote:
Kaveh's email just reach me now, so I have only seen other parts of the discussion so far.

On Tue, Aug 4, 2015 at 5:55 PM, Kaveh Bazargan <kaveh@rivervalleytechnologies.com<mailto:kaveh@rivervalleytechnologies.com>> wrote:
Forgive me for a very basic question, but it is a devil's advocate type of question. And if this is not the place to ask this perhaps you can direct me to any relevant discussions.

My very basic question is, why do we need to "paginate" in the browser in the first place? Why not keep the browser for reflowing and interactive text, which is what it is good at, and use a standard mark-up pagination system (TeX/LaTeX would be my choice) to do what that is good at. If another system has already solved problems like footnotes and floating figures, what exactly is the drive to reinvent that in the browser?

I am myself a LaTeX person and for a lot of things I would agree with you.

However, there are some good reasons to do everything in browsers:

A) You can have one source file for everything and don't need to do conversion

B) Epub is already tied to HTML, sousing LaTeX as the universal format will likely not work in the long run

C) Most people have a browser installed already, so you don't need to have them install anything else on their machine

D) Browsers running extra layout JavaScript can be made to render more or less complex layout of the same sources. So far example you may say that you just want to show the text and put the footnotes at the bottom in a single parse. The layout will not be perfect, but on a mobile device that will give you a quick result. But on a server that is to produce a PDF out of the same source document, you can have it use a 7-parse process and add kerning, microtyping, etc.

E) LaTeX document editing is not exactly easy. Many of the LaTeX documents I wrote 10-15 years ago I cannot simply parse using my current laptop with the latest TeXLive installed. And most of those are just 5-10 page long midterm papers for History, Literature or English language (so no advanced formulas, just citations and plain text). For my books I tried to add a few minor extras (such as a small flag icon that would be added before and after the chapter titles), and when I need to rerender them after not having rendered them for a year or two, I generally have to spend about a day on various online discussion forums to try to figure out what has changed in the latest versions of the renderers and how I can get around those issues. I am not entirely sure, but I imagine that this would have been easier had the sources been in HTML, as the renderer would at least render everything that it did understand instead of the everything or nothing approach of LaTeX.

Actually TeX is the fastest page renderer. Standard TeX files create pages at over 100 pages a second on a normal laptop, including complex math and footnotes. And I am surprised you had problem running old files. You must have been using style files which had not been maintained. The TeX engine has been frozen for 30 years!

But for this discussion most of that is irrelevant I think.


I wonder if point D is entirely clear to everyone. When CSS features are discussed, one of the most important points is of course whether browsers will implement them. Features that are so complex that the rendering of the contents of a page will take as long as it takes for a LaTeX renderer to create a PDF will likely not make it, because speed is more important that high feature level for browsers for which pages-based features are just a side project. But some will need such complexity for rendering really great looking output (for example for print output).

From browsers probably the best one can ever expect is that they will provide fast and simple page layout. But if one has the needed primitives to allow for more complex solutions in browsers using JavaScript, then one can still create those sites that spend 5 minutes on rendering the final output.



On Tue, Aug 4, 2015 at 8:03 PM, Kaveh Bazargan <kaveh@rivervalleytechnologies.com<mailto:kaveh@rivervalleytechnologies.com>> wrote:


On 4 August 2015 at 18:50, Bill Kasdorf <bkasdorf@apexcovantage.com<mailto:bkasdorf@apexcovantage.com>> wrote:
A quick clarification. I am quite sure that in her e-mail Deborah is using the term "pagination" to mean "maintaining a record in the digital file of where the page breaks occur in the paginated version of record." That's essential to accessibility and other useful things as well (citations, cross references, indexes, etc. in a world in which print is still considered the version of record and references to its page breaks are common.) That's not the same as making the _rendered pages_ in the digital file replicate those in the print.—Bill K

[...]


But Bill, how do we make the page breaks in the electronic version to be the same as those of the print pages unless we have the same elements and layout? For instance if a floating figure is missing from an electronic page, do we just make a short page and break where the paper copy breaks? That would lead to very ugly results.


The end device should be able to both figure out what page numbers would be in the normal sized output AND what it is on the actual device. All without having to add extra meta data about where non-explicit page break occur.

So basically it renders the pages twice:

A) Once in the original size. This can be done in a way so the end user doesn't actually have to see it. The page numbers are retrieved from this version. A could be made to be exactly equal to the print version (or the other way round: in order to create the print version, one simply prints out A).

B) A second time for the user to see it in the size appropriate for the zoom level and  screen size.

There are various ways this could be presented to the user in the User Interface. For example the "Jump to page number" function could be using the page numbers retrieved from A but then jump to the correct location in B. And the page numbers shown in the corner of the pages could also be the ones retrieved from A (that would mean several pages in a row could be displayed with the same page number and one B page could have two page numbers if it happens to span over the break between two A pages.




--
Kaveh Bazargan
Director
River Valley Technologies
@kaveh1000
+44 7771 824 111<tel:%2B44%207771%20824%20111>
www.rivervalleytechnologies.com<http://www.rivervalleytechnologies.com/>
www.bazargan.org<http://www.bazargan.org/>
Received on Thursday, 6 August 2015 14:53:41 UTC