- From: Sommer stud <Torgeir.Veimo@nr.no>
- Date: Mon, 11 Jul 94 14:26:31 +0200
- To: Multiple recipients of list <www-html@www0.cern.ch>
> Yes, HTML to PostScript. Finally someone else than me is seeing the need for such a utility! Maybe this forum isn't the most apropriate forum of discussing such a utility. Drop me a note if you think so. > In a couple of situations I've been involved in, in the production > of large HTML documents, the need has arisen to non-interactively produce > printed output from existing HTML source. This way, the documents > can be easily edited, revised, and discussed, even with people who > don't have access to any networks. Not to mention that one usually don't want to read a long text on-line. > I realize that Mosaic has a "save as postscript" option, but it's slow > to use if you're not browsing, and just need output quickly. Besides, its output is extremely primitive, in that the text is formatted against the screen metrics, not the printer metrics (font type, page width, font metrics etc.) It doen't give much choise when it comes to formatting the text in different ways either (or, how do I Print this document using my printers palatino (insert your favourite font here..) fonts.) > Ideally, I'd like a utility that will take a file, produce PostScript > output based on the file's HTML markup, and, ideally, follow links in > that file to other HTML files, recursively produce the PostScript output, > etc., without direct human interaction except to start it running. Well, you are grabbing a topic which deals with how to structure a document for both online and offline reading. How is different pages combined into a single document? Would printing a table of contents really include printing the entire text that is pointed to by this ToC? This could possibly be solved by havinga modified anchor tag, something like <a incl=""> which would imply a part-of page. I don't want to discuss this right now though. > Does such a utility exist? Never heard of something like this, and can tell you I have really been looking (/ listening) hard. :) > If not, I'll probably try to write one. Would anyone else out there > be interested in such a utility? Does there exist public-domain > HTML-parsing code, perhaps in Perl, that would speed up creating this? This is something I have been considering to write for a long time, since the output from Mosaic is really _BAD_. (I print _A_LOT_ of html pages.) To bad I have never had the time to do it, and that I'm a bit green when it comes to programming in postscript. At first I was thinking of using existing tools for this, such as psroff or the likes, since it could easily use the strength of tools like eqn and tbl to format html+ document specialities (tables and equations). However, I assume that parsers that deal with tables and equations eventually will pop up which could easily be merged into a standalone utility for converting directly to postscript. Also, relying on other tools is something I don't like very well, unless they are distributed with the tool that rely on them. I will outline what I think would be necessary to do for building a html2ps converter. Just a sidestep first.... One problem would be the inclusion of images, since they are only referred to, not included in the html file. A command line utility for getting things referred to by an <img src=""> or <a href=""> would be nice to have, which one could invoke by eg. wwwget "http://www.ii.uib.no/~torgeir/image.gif" /tmp/raaas123.gif which would put the file refered to by the URL into the specified file, or optional output to stdout. One could use such a utility in a converter when necessary to include other documents or to include images. It would even be possible to do something like: wwwget "http://www.ii.uib.no/~torgeir/" | html2ps | lpr -Php4 Such a utility would not be necessary for a converter however, but it would be the most easy way to include images and such. -- Now for the converter itself. I was thinking that using existing html parsers libraries would be useful, or one could simple build a new one. Stage 1: Parse html input into independent paragraphs A very simple internal data structure could be used. The html input would be parsed out to separate paragraphs, each including information about: - type, - indentation, - wether to put a bullet or number in front, - the space to insert before / after the paragraph. This would make each paragraph completely independent of the others. Additional information could be included wether to glue a paragraph to the next (to avoid headers at the bottom of a page) and such. The hierarchy would be resolved when parsing, eg. a recursive lists would only cause different indentation levels and bullet / list number style. Stage 2: Output each paragraph to postscript Each paragraph needs to be broken into lines. This could possibly be done in the postscript code itself, or one could rely on font metric files. It could be optional to justify the text or not, if it weren't specified in the input. Also, images could be included at this stage. I would love to discuss this topic further, so feel free to give your most honest opinion. (Sorry for any errors that might have found their into my document...) -- Torgeir @ http://www.ii.uib.no/~torgeir/ - this summer @ http://www.nr.no/home/veimo/
Received on Monday, 11 July 1994 14:26:41 UTC