Re: HTML -> XML -> WML using jtidy from Martin Wickman on 2000-10-29 (html-tidy@w3.org from October to December 2000)

From: Martin Wickman <martin.wickman@infohwy.se>
Date: Sun, 29 Oct 2000 14:14:15 +0100
To: html-tidy@w3.org
Message-ID: <92e28b52.8b5292e2@infohwy.se>
From: Frank Steuer <steuer@ece.orst.edu>

Thanks for your reply!

(I have been working/debugging all night with this, so excuse any 
sleepishness :-)

> I try to do the same job - but to achieve a general solution.
> I also used jtidy and then xalan and xerces to transform the XML 
> documentsto wml (or cHTML, XHTML subsets or HTML subsets) via XSLT.

That's my idea as well. Feels good to know there are others out there 
with the same problems.

Btw, do you use the DOM representation or do you just "pipe" the 
tidied, prettyprinted output from jtidy to xalan etc? 

I have some reservations about the status of jtidys DOM support. jtidy 
happily parses my html into a DOM document. But when I try to traverse 
the DOM-tree to produce a textual output, I dont get all elements and 
some other stuff are missing as well. This makes me a bit suspicous.

I have tried using my own prettyprinter, jdoms DOM output and a few 
others, but none of them produces the correct output.

Maybe I am doing something wrong. Here is a snippet:

	Tidy tidy = new Tidy();
	tidy.setXmlOut(true);
	tidy.setXmlPi(true);
	Document doc = tidy.parseDOM (in, null);
	tidy.pprint (doc, System.out);

pprint() produces what looks like a correct XML representation. But if 
I use the Document doc object with jDOM or send it to the print() 
method in the example class (TestDOM from sourceforge) it prints 
nothing. The html test document is wellformed and as simple as possible.

My original idea was to use the DOM representation and then call xalan 
with a XSL stylesheet. If that wont work I guess I have to parse the 
tidied XML string again using another XML parser. 

> It works - more or less. The problems I have is that I try to 
> transcodedocuments I do not have any control about. (lots of 
> errors, headings used as layout tool and not to define the 
> structure of an document etc....)

I know the feeling, unfortunately I cannot give you any helpful hints. 
But if you manage to get it to work and GLP it, it would be a huge 
donation to the opensource community and http://www.kannel.org in 
particular :-) 

> One of the problems I still have to solve is the splitting of big xml
> documents in several decks and cards. Here you should not have 
> that big problem, because you said that you have a kind of control
> about how the html documents are written. 

Sure enough I will face that problem as well. But I dont think 
splitting documents into several cards will solve the low-memory issues 
(afaik, a deck is sent with all cards at the same time?). I guess that 
the files will have to be splitted up somehow anyway, inserting 'Next 
section...' and 'Previous section...' tags.

> I would try XSL(T). It is pretty easy and by changing the XSL 
> stylesheetsyou can try to get the wanted output. You don't have to 
> change the application, recompile it to java bytecode etc.

I have started writing some XSLT files for the XML/HTML to WML 
conversion.
 
> I will publish the results of my work pretty soon as GPLed source. 
> Right now it does not make sense because it is to much under 
> construction and not documented yet.

Great. I'll be watching this space.
Received on Sunday, 29 October 2000 08:12:10 UTC