W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2005

Re: Problem with Microsoft Word and Tidy

From: Dave Raggett <dsr@w3.org>
Date: Wed, 10 Aug 2005 11:06:40 +0100 (BST)
To: David Wilczynski <dwilczyn@usc.edu>
Cc: html-tidy@w3.org, Tom Lipkis <tal@pss.com>
Message-ID: <Pine.LNX.4.61.0508101039540.10139@holly>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The support in Tidy for cleaning up the output of HTML exported by 
Microsoft Office is pretty dated. Office97 proved much easier to 
clean up than Office 2000, and I don't know if anyone has looked 
into improving Tidy's current support and studying what is required 
for Office 2003 and beyond.

This is a splendid opportunity for volunteers to study what is 
needed and to identify techniques for addressing the resultant 
requirements. It may prove easier to work off the doc format than 
the mess exported as html/xml. Open Office includes a pretty good 
import mechanism for Office and could be leveraged for HTML Tidy.

One of the problems is how to identify what the author intended
as this is well hidden within the document model used by Word.
In essence, we need an expert system than can construct a plausible
reconstruction without a mess of styles on each paragraph or inline
text. The current code in Tidy strips a lot of this out and a much
better job could be done at inferring the stylesheet rules.

I no longer have the time to work on this, so this is a call for
volunteers to assist the current developers working on the Source
Forge site for Tidy, see http://tidy.sourceforge.net/

p.s. it may be possible to gradually wean people off Word if there 
were effective and free alternatives that would run in the web 
browser without the need to install any software. I am looking into 
how to achieve this using the design mode feature in IE 5.5+ and 
Mozilla-based browsers since 1.3 (including Firefox, Galeon, 
Epiphany, etc.). The promise has been demonstrated by HTMLArea,
FCK text editor and widgEditor. (use google for the links)

- -- 
  Dave Raggett <dsr@w3.org>  W3C lead for multimodal interaction
  http://www.w3.org/People/Raggett +44 1225 866240 (or 867351)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFC+dG2b3AdEmxAsUsRArrKAKCN27Vrmq6qc2ZBpAKFMyaLqFMr/wCgmdjA
1AxYD4+E2DxKASRYqUEs4G4=
=BtK1
-----END PGP SIGNATURE-----
Received on Wednesday, 10 August 2005 10:06:36 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:55 GMT