W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2003

Re: New Site Cleaning Project

From: Fred Christian <fred@gloryofgod.com>
Date: Tue, 18 Mar 2003 09:20:02 -0800
Message-ID: <003401c2ed72$9e530210$6601a8c0@SLIM>
To: <html-tidy@w3.org>, <asp@tvw.net>

Hi Julian,

My experience has been with a company that wanted their writers to use MS
Word to author their documents. I wrote a tool for them to export the Word
.doc files to HTML.

I wrote an application that did these basic steps.

1. Allows the user to select a folder.
2. App identifies all the appropriate .doc files in that folder.
Loop through each document doing the following.
3. Open the document and make some adjustments so it will export corectly,
then export it as HTML.  Then close the document.
4. Run some custom cleanup on the document to prepare it to go through HTML
Tidy.
5. Run Microsofts "Filter" application on the file with some specific
switches active to further prepare the document for HTML Tidy.
6. Run the html through HTML TIDY with the configuration file set up to
output XHTML.
7. Do some more custom cleanup on the XHTML output.
8. Transform the XHTML using XSLT.  I used the Saxon parser because it is
more powerful the Microsofts built in MSXML.dll

Things I learned along the way are...
- MS Word is evil (actually I allready knew this, now it is firmly seated in
my brain).
- HTML Tidy and Saxon are wonderful tools.
- There are tools out there that attempt to export Word documents to HTML
but I couldn't use them because I needed more customization than they
provided and I suspect that many people will have the same problbems I did
- Even though I created a tool to export the documents, it was still
imperative for the writers to strictly use the correct styles and formatting
in their Word Doc's in order to get the desired results.
- C# and .net are cool.  But for my app, I didn't find a need for it.  So
much of it was just calling other proccess's.  I decided to keep it simple
and able to run on any Windows computer instead of requiring that the
computer had the .net framework installed.

Hope that helps. If you have specific questions feel free to ask.

Fred
----- Original Message -----
From: "Julian Voelcker" <asp@tvw.net>
To: <html-tidy@w3.org>
Sent: Tuesday, March 18, 2003 8:32 AM
Subject: New Site Cleaning Project


>
> I quite often have to tidy up websites or convert documents to be added
> to a website or a content management system.
>
> To date I have done a lot of it by hand, but am now contemplating
> developing it into a small application that I may consider
> distributing.
>
> I want the app to be able to work on a set of directories and do the
> following:-
>
> - If the docs are in non html format, convert them to html.
> - Use Tidy to do an initial pre-processing tidy up of the documents so
> that the html is in a clean, non tabbed, formatted structure.
> - Use a regular expression find and replace facility to strip out
> unwanted code - quite a bit of hand work will be required here, but the
> idea would be to store each search so that the user can build up a
> library of searches during testing before applying them to a complete
> site.
> - Use Tidy to do a final post processing re-format of the finished
> output.
> - If the documents are to entered into a database, extract the relevant
> data and output in a format suitable for importing into a database.
>
> I'm going to be writing the app in C#.
>
>
> I would appreciate any comments/feedback from the users here as to
> whether this is a good idea (has it been done before) and whether you
> have any advice for me.
>
> It's been a while since I have written a Windows app so I may be a
> little slow in getting started with it.
>
> Cheers,
>
> Julian Voelcker
> The Virtual World (UK) Limited
> Cirencester, United Kingdom
>
>
Received on Tuesday, 18 March 2003 12:17:37 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:53 GMT