W3C home > Mailing lists > Public > w3c-wai-ig@w3.org > January to March 2012

Re: Converting html to pdf with java

From: David Woolley <forums@david-woolley.me.uk>
Date: Sat, 11 Feb 2012 10:49:32 +0000
Message-ID: <4F3647BC.20103@david-woolley.me.uk>
To: w3c-wai-ig@w3.org
Adam Cooper wrote:

> The question is not whether it is possible, but why would you bother. 
>  The majority of PDFs on the web are derived from electronic source 
> documents, do not utilise the content security features offered by the 
> platform, and creating accessible PDFs is time-consuming, incurs time 

In principle, mechanical construction of tagged PDF from HTML should be 
very easy, once you have a Postscript renderer for HTML, as tagged PDF 
is essentially a standard PDF overlayed with a description of the 
corresponding HTML4 structure.  It is marginally more likely that an 
HTML original will have proper semantic markup than that a Word document 
has been styled properly (i.e. both are rather unlikely in the real world).

> and monetary costs, and is beyond the skill level of most casual content 
> creators, so I struggle to find compelling reasons why there is a need 

As is creating accessible HTML.  It doesn't take a lot of skill, but 
authors make the task difficult by concentrating on presentation.  The 
skill is in creating the accessible document whilst maintaining the 
intended visual appearance.

> to use PDF at all, especially when there are tools and methods in 
> existing non-proprietary technologies such as (X)HTML, CSS, and JS etc. 
> which offer comparative content securing and (print) formatting 
> functionality.   

Breaking the securing of PDF takes a certain amount of technical skill, 
and/or specialist tools.  Breaking the securing of HTML is simply a case 
of turning off scripting.

As to formatting.  My view is that most of the accessibility problems 
with real world HTML come from trying to treat it as a page description 
language.  Using tagged PDF would at least be honest, and, because one 
doesn't have worry about the constraints imposed by a structural 
semantics language when creating the presentation, ought to produce much 
more resillient documents.  Of course, this is an argument for not using 
HTML as an intermediate format.

> 
> *From:* Tanguy.Loheac@sanofi.com [mailto:Tanguy.Loheac@sanofi.com]

> anyone had the chance to expirement a java library that would convert 
> accessible (x)html page to an accessible pdf document?

The Java constraint is too restrictive.  I think all fully Java HTML 
renderers reached a dead end.  The only specific HTML to PS tool (as 
used to create the HTML4 specifications in PDF) was in Perl, but that 
was written before tagged PDF.

A tool to create tagged PDF from HTML really needs to be based on one of 
the major HTML rendering engines, which means it should be in C or C++.

-- 
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.
Received on Saturday, 11 February 2012 10:50:09 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 11 February 2012 10:50:10 GMT