Re: Prioritisation from Bill McCoy on 2015-08-09 (public-digipub@w3.org from August 2015)

From: Bill McCoy <whmccoy@gmail.com>
Date: Sun, 9 Aug 2015 09:40:12 -0700
To: Leonard Rosenthol <lrosenth@adobe.com>
Cc: Johannes Wilm <johanneswilm@vivliostyle.com>, Kaveh Bazargan <kaveh@rivervalleytechnologies.com>, Dave Cramer <dauwhe@gmail.com>, Richard Ishida <ishida@w3.org>, W3C Digital Publishing Discussion list <public-digipub@w3.org>
Message-ID: <CAJ0DDbBA7x3qYb=NTGhyufyBVAFOE4YO8cdQqd3u1ALTKUV-rA@mail.gmail.com>
HI Leonard,

You're right, I had not considered the a11y aspects of PDF, which I agree
have more staying power than legacy "interactive PDF" features, and it is
certainly a valid requirement that export from HTML/OWP to PDF should
preserve this information.

As far as the priority of this group working on supporting that requirement
though... the a11y community is deeply unhappy with PDF: while in principle
many a11y requirements can be supported in PDF, in practice the pre-typeset
nature of PDF and inconsistent support of these features by PDF generators
(including even Adobe products!) results in poor experiences. The a11y
community has largely moved on from pushing for accessible PDF to promoting
EPUB 3 since it is a much superior format to PDF for meeting a11y
requirements, and commercial publishers are moving away from PDF. So to me
I would see effort on supporting HTML->PDF a11y as noble and useful but
that this group would be better off focusing its limited cycles on helping
to improve OWP for publishing and helping to realize the EPUB-WEB vision
sooner rather than later so that the packaged form of OWP content is what
delivers on a11y mandates (and is what is required/mandated by governments
and enterprises).

In any case I'm not sure improving a11y of PDF generated from website
content requires any W3C level work that would be in the purview of this
group to help advance, it would seem to me more of an implementation
project than a standards project. It could be for example part of a Readium
project on EPUB POD (since we would see Braille and aural output as logical
 POD outputs from  EPUB), and/or work done in Mozilla/Webkit/Blink
codebases.

--Bill


On Thu, Aug 6, 2015 at 11:01 AM, Leonard Rosenthol <lrosenth@adobe.com>
wrote:

> Always nice when we agree, Bill.   So I’ll start with more places that I
> agree with you…
>
> I do agree that many of the things that have been referred to as
> “interactive PDF” have found other outlets – such as multimedia and
> scripting.  And I agree that having such things be converted to PDF isn’t
> worth the effort at this time.
>
> However, there are many other aspects concerning PDF production for
> non-printing contexts that you are overlooking and that are necessary (and
> even mandated by laws).  For example, standards such as PDF/A (‘a’
> conformance level) and PDF/UA require that a PDF be tagged/structured – a
> feature that (by designs) maps well to HTML’s semantic elements.  But only
> a few of the tools out there will produce this structure in the output PDF
> – and even those that do don’t always do it in the same way.  And as long
> as governments and enterprises require/mandate PDF, and in one of those
> standards, it remains something that this group (and the industry as a
> whole) will need to support…(even if you only think it’s good for
> “replicating paper” :).
>
> Leonard
>
> From: Bill McCoy
> Date: Thursday, August 6, 2015 at 12:35 PM
> To: Leonard Rosenthol
> Cc: Johannes Wilm, Kaveh Bazargan, Dave Cramer, Richard Ishida, W3C
> Digital Publishing Discussion list
> Subject: Re: Prioritisation
>
> I agree with Leonard about the deficiencies in today's browser printing
> pipelines but wanted to add a couple things:
>
> - EPUB print-on-demand solutions are starting to appear, I saw one being
> offered for the Japanese market last month at Tokyo International Book
> Fair. There has been interest expressed in an initiative on this by folks
> from major print/POD players (HP, Toppan, Dai Nippon Printing, Ingram) as
> well as the accessibility community. No active project is yet under way in
> IDPF/Readium but I anticipate there may be something soon and would love to
> have this coordinated with this group's activities. Today's prevalent
> solutions in the CSS formatting space (AntennaHouse and Prince) are
> proprietary so having something open source, built on a browser engine, and
> helping to advance the relevant open standards (esp. CSS) would seem
> helpful. I don't think this necessarily needs to wait for other parts of
> the EPUB-WEB vision to be realized so it could be a good candidate for
> near-term efforts that would yield rapid useful results.
>
> - I think it would make most sense for this group to focus on OWP->PDF
> that is good for high-quality printing... I believe the "interactive PDF"
> has passed it's sell-by date and fundamentally trying to map HTML forms
> into PDF forms, HTML JS APIs into Acrobat scripting APIs etc. seems both
> unlikely to be fruitful and unlikely to be very useful since we already
> have EPUB and are moving in the direction of EPUB-WEB.  There are some
> areas of functionality missing from EPUB that are present in PDF that could
> be helpful to OWP in general and thus in scope for EPUB-WEB and thus this
> group. For example, digital signatures (in the legal sense). Working on
> filling these gaps seems better to me than worrying about PDF as anything
> else than its design center and primary use case as a replica of paper.
>
> --Bill
>
>
>
> On Thu, Aug 6, 2015 at 8:52 AM, Leonard Rosenthol <lrosenth@adobe.com>
> wrote:
>
>> >Yes, except that the HTML-to-PDF renderer present in browsers is used by
>> more than just us book people,
>> >which gives it slightly higher chances of technical survival over the
>> next few years.
>> >
>> Unfortunately, those converters produce output that is useless for
>> anything other than printing (and in some cases, not even that).   All
>> sense of semantics and non-static content have been lost and pagination is
>> arbitrary and uncontrollable (which is what started this thread, IIRC).
>>  it’s also quite unclear when the process should proceed – when are scripts
>> “done” and the content is “ready”.   We recently undertook a detailed
>> product/technology comparison in this area, so this is not speculation but
>> fact.
>>
>> This (OWP->PDF) is one of the areas that has brought us back into active
>> participation as our customers and the industry as a whole is being damaged
>> by the lack of standardization (or even implementation!) in this area.
>> There are a lot of great (draft!) specs out there that could potentially
>> resolve some of these issues, but it will require a group (such as this
>> one) to pick the one(s) that we feel are solving the correct problems,
>> validate that the concerns of all constituents (not just book and magazine
>> publishers) are met, and work to see them implemented in the key UA
>> technologies.  Not a “quick fix” problem – but the sooner we start, the
>> sooner it will be resolved.
>>
>> Leonard
>>
>> From: Johannes Wilm
>> Date: Thursday, August 6, 2015 at 11:40 AM
>> To: Kaveh Bazargan
>> Cc: Dave Cramer, Richard Ishida, W3C Digital Publishing Discussion list
>> Subject: Re: Prioritisation
>> Resent-From: <public-digipub@w3.org>
>> Resent-Date: Thursday, August 6, 2015 at 11:41 AM
>>
>>
>>
>> On Thu, Aug 6, 2015 at 4:49 PM, Kaveh Bazargan <
>> kaveh@rivervalleytechnologies.com> wrote:
>>
>>>
>>>
>>> On 6 August 2015 at 14:58, Johannes Wilm <johanneswilm@vivliostyle.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Aug 6, 2015 at 1:05 PM, Kaveh Bazargan <
>>>> kaveh@rivervalleytechnologies.com> wrote:
>>>>
>>>>> Hi Johannes
>>>>>
>>>>> I am flattered by your comprehensive reply. My comments regarding TeX
>>>>> are below, but I might not have explained myself well...
>>>>>
>>>>> I am not suggesting anyone should use TeX code, or even be aware that
>>>>> TeX/LaTeX is involved. The point is that it is a back end automated page
>>>>> make up engine. So XML/HTML can be converted to PDF very fast and at very
>>>>> high quality with the TeX engine invisibly doing the work.
>>>>>
>>>>
>>>> Ok, so you are proposing converting XML to LaTeX and Epub/HTML on a
>>>> backend system? The main problem with that is the conversion mechanisms
>>>> just about always need human intervention and that it's hard to impossible
>>>> to get XML input files from authors.
>>>>
>>>
>>> But conversion of XML to a fixed layout view is same as HTML to fixed
>>> layout is it not, which is the aim of this group? Would that not need the
>>> same human intervention? The human intervention is needed because
>>> publishers want the same look as a journal they have had for decades. With
>>> a little modification the process can be entirely automated.
>>>
>>
>> Yes, except that the HTML-to-PDF renderer present in browsers is used by
>> more than just us book people, which gives it slightly higher chances of
>> technical survival over the next few years.
>>
>>
>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Here are my points, distilled:
>>>>>
>>>>>    - I like the idea of HTML/CSS/Javascript creating fixed pages to
>>>>>    be read on screen with all kinds of interactivity
>>>>>    - I still question trying to create footnotes, floating figures
>>>>>    and tables, and typographic niceties which have primarily evolved for print
>>>>>    on paper, being done in the browser. To me, floating items only apply to
>>>>>    print, so no interactivity is not needed. Why not pass the info to an
>>>>>    engine that knows how to do it well?
>>>>>
>>>>> Is there not also a point in having footnotes and floating figures in
>>>> ebooks (and have those still work when the user changes the font size
>>>> level)?
>>>>
>>>
>>> Floats are a matter of opinion. I would say no, I don't want to flick to
>>> the next page and back again. I want to hover or click and fig pops up.
>>> Floats have been needed because of the obvious limitations of print. My
>>> preference for footnotes is similar, i.e. click or touch screen to get more
>>> info. But that can be a user's decision. We should have renderers that
>>> produce whatever a user prefers.
>>>
>>
>> Agreed, this should be user preference. I think on scientific ebooks, for
>> example, I would still like real footnotes at the bottom and I think the
>> lack of good footnote support is why many still use PDFs instead of epubs
>> for certain types of texts.
>>
>>
>>>
>>>>>    - The problem of floating items, complex math, large footnotes
>>>>>    that need to break across pages, and many other complex pagination problems
>>>>>    have already been solved in TeX. These are not trivial problems and I worry
>>>>>    about this working group reinventing the wheel, by starting to specify the
>>>>>    basics of pagination from scratch. In my opinion, in the end the only way
>>>>>    to solve the problem is to rewrite TeX in JavaScript!
>>>>>
>>>>>
>>>> I have also been thinking of LaTeX in Javascript. But as far as I can
>>>> tell, that in 2015 that would still be too slow. TeXLive is a few GB large,
>>>> and if the user should wait for a few GB to download before the page is
>>>> rendered, that likely wouldn't work. In a few years, when a few GB is
>>>> nothing and processors are faster, this may be a viable alternative.
>>>>
>>>
>>> You got it wrong here, Johannes. ;-) TeXLive contains every possible
>>> style file you might want – 10,000s. The basic TeX compiler is only 500K!
>>> Remember it is 35 years old, so had to run on mainframes, which is why it
>>> is fast. Even with a few basic style files I don't think it would exceed
>>> 2Mb for an automated pagination system.
>>>
>>
>> Right. But you will likely need some of those packages, if you will want
>> to let people render their documents.
>>
>> The use case I had was that I had people who were to write scientific
>> articles and I was thinking of how I could get them to render the documents
>> themselves, without having to go through the process of installing LaTeX
>> and doing weird things on the command line. So in that case I couldn't know
>> exactly what packages my users would need.
>>
>> So maybe this is indeed possible if one locks everyone down to a small
>> subset of everything. I just haven't seen that in action ever. Linux always
>> seems to ask me to just install another 600 MB of files whenever I try to
>> compile the file of someone else.
>>
>>
>>
>>> Forgive me, but storing in several formats is absolutely the worst thing
>>> you can do. What if there is a difference between the files? Which one is
>>> right? Who knows? No one! Already there is a problem brewing. Go to any
>>> open access journal (PeerJ, Plos, Frontiers etc) and pick a paper. You will
>>> find the paper has a DOI – the definitive version of record. But which
>>> *format* is the version of record, the XML, the HTML or the PDF. None of
>>> the publishers have the courage to nominate one!! Of course they all know
>>> it should be the XML but only the PDF has been proofread. No one looks at
>>> XML – except me!!
>>>
>>
>> I know I know. Among software developers we have similar sayings when it
>> comes to data.
>>
>> And still: If you have the same program saved in 1996 both on a CD-ROM
>> and on a 3.5" disc, you can read both and they differ, you may have a
>> problem of figuring out which one is "better". But if you only save to one
>> of them, you may end up not being able to access it at all. So it ends up
>> being wiser to make multiple copies in different formats anyway, even if it
>> may put you in that dilemma.
>>
>> I assume that the same reason people not only maintain digital copies,
>> but also paper copies (on special paper) that are then stored on different
>> continents.
>>
>>
>>>
>>> I think that the excitement we are all experiencing with HTML (including
>>> myself) is going to have bad consequences in future unless we set some
>>> really firm rules. It is now the 350th anniversary of the first scholarly
>>> journal. We can still read it with no ambiguity. So it has been amazingly
>>> future-proof. 350 years from now, will scholars be able to read our
>>> scientific literature without ambiguity? I doubt it!
>>>
>>
>> It's not impossible. But it's not just HTML. It's everything about our
>> times that requires fast changes which means that none of us can be sure we
>> will even be able to read this conversation in 5 years time.
>>
>>
>>
>>> --
>>> Kaveh Bazargan
>>> Director
>>> River Valley Technologies
>>> @kaveh1000
>>> +44 7771 824 111
>>> www.rivervalleytechnologies.com
>>> www.bazargan.org
>>>
>>
>>
>
Received on Sunday, 9 August 2015 16:40:42 UTC