Re: HTML -> PS converter?

Torgeir Veimo (Torgeir.Veimo@nr.no)
Mon, 11 Jul 94 14:26:31 +0200


To: Multiple recipients of list <www-html@www0.cern.ch>
Subject: Re: HTML -> PS converter?
Date: Mon, 11 Jul 94 14:26:31 +0200
From: Torgeir Veimo (Sommer stud) <Torgeir.Veimo@nr.no>
Message-Id: <"boeygen.nr.682:11.06.94.12.26.37"@nr.no>


> Yes, HTML to PostScript.  

Finally someone else than me is seeing the need for such a utility! 

Maybe this forum isn't the most apropriate forum of discussing such a utility.
Drop me a note if you think so.

> In a couple of situations I've been involved in, in the production 
> of large HTML documents, the need has arisen to non-interactively produce 
> printed output from existing HTML source.  This way, the documents 
> can be easily edited, revised, and discussed, even with people who 
> don't have access to any networks.

Not to mention that one usually don't want to read a long text on-line. 

> I realize that Mosaic has a "save as postscript" option, but it's slow
> to use if you're not browsing, and just need output quickly.

Besides, its output is extremely primitive, in that the text is formatted
against the screen metrics, not the printer metrics (font type, page width,
font metrics etc.) It doen't give much choise when it comes to formatting the
text in different ways either (or, how do I Print this document using my
printers palatino (insert your favourite font here..) fonts.)

> Ideally, I'd like a utility that will take a file, produce PostScript
> output based on the file's HTML markup, and, ideally, follow links in
> that file to other HTML files, recursively produce the PostScript output,
> etc., without direct human interaction except to start it running.

Well, you are grabbing a topic which deals with how to structure a document
for both online and offline reading. How is different pages combined into a
single document? Would printing a table of contents really include printing
the entire text that is pointed to by this ToC? 

This could possibly be solved by havinga modified anchor tag, something like
<a incl=""> which would imply a part-of page. I don't want to discuss this
right now though.

> Does such a utility exist?

Never heard of something like this, and can tell you I have really been
looking (/ listening) hard. :)

> If not, I'll probably try to write one.  Would anyone else out there
> be interested in such a utility?  Does there exist public-domain
> HTML-parsing code, perhaps in Perl, that would speed up creating this?

This is something I have been considering to write for a long time, since the
output from Mosaic is really _BAD_. (I print _A_LOT_ of html pages.) To bad I
have never had the time to do it, and that I'm a bit green when it comes to
programming in postscript.

At first I was thinking of using existing tools for this, such as psroff or
the likes, since it could easily use the strength of tools like eqn and tbl to
format html+ document specialities (tables and equations).  

However, I assume that parsers that deal with tables and equations eventually
will pop up which could easily be merged into a standalone utility for
converting directly to postscript. Also, relying on other tools is something I
don't like very well, unless they are distributed with the tool that rely on
them.

I will outline what I think would be necessary to do for building a html2ps
converter.

Just a sidestep first....

One problem would be the inclusion of images, since they are only referred to,
not included in the html file. A command line utility for getting things
referred to by an <img src=""> or <a href=""> would be nice to have, which one
could invoke by eg.

wwwget "http://www.ii.uib.no/~torgeir/image.gif" /tmp/raaas123.gif

which would put the file refered to by the URL into the specified file, or
optional output to stdout.

One could use such a utility in a converter when necessary to include other
documents or to include images. It would even be possible to do something
like: 

wwwget "http://www.ii.uib.no/~torgeir/" | html2ps | lpr -Php4

Such a utility would not be necessary for a converter however, but it would be
the most easy way to include images and such.

--
Now for the converter itself. I was thinking that using existing html parsers
libraries would be useful, or one could simple build a new one.


Stage 1: Parse html input into independent paragraphs

A very simple internal data structure could be used. The html input would be
parsed out to separate paragraphs, each including information about:

- type, 
- indentation,
- wether to put a bullet or number in front, 
- the space to insert before / after the paragraph. 

This would make each paragraph completely independent of the others. 
Additional information could be included wether to glue a paragraph to the
next (to avoid headers at the bottom of a page) and such.

The hierarchy would be resolved when parsing, eg. a recursive lists would only
cause different indentation levels and bullet / list number style.


Stage 2: Output each paragraph to postscript

Each paragraph needs to be broken into lines. This could possibly be done in
the postscript code itself, or one could rely on font metric files.  It could
be optional to justify the text or not, if it weren't specified in the input. 
Also, images could be included at this stage.


I would love to discuss this topic further, so feel free to give your most
honest opinion.

(Sorry for any errors that might have found their into my document...)

--
Torgeir @ http://www.ii.uib.no/~torgeir/

- this summer @ http://www.nr.no/home/veimo/