W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2001

Cleaning HTML Sources With Tidy

From: BARRY MEEHAN <bmeehan@us.ibm.com>
Date: Tue, 2 Oct 2001 11:03:13 -0400 (EDT)
To: html-tidy@w3.org
Message-ID: <OF20E8AC90.399AA7E9-ON85256AD8.0077B0C1@pok.ibm.com>
IBM Product Lifecycle Management has a business partner that develops a
software product that IBM markets and supports.  As part of the development
effort, the business partner creates the end-user product information in
HTML.  When we send this source to our IBM translation centers, the tool
that checks the HTML files for compliance with our HTML guidelines (which
are based on W3's) always finds errors.  The business partner creates the
files with MS FrontPage.  We have a deviation that allows them to use the
FONT attribute for "human factors" reasons, even though it's a problem for
Japanese translation.

So, I asked the business partner to give Tidy a spin becuase it looks like
it would catch and correct the majority of the problems our checking tool
finds.  Attached is their findings.  It's possible they missed something in
the instructions or release notes because it does seem to have missed
things it should have fixed.  I am interested in your reaction, in
particular, which of the errors can it not fix?

Thanks,

Barry Meehan
Internet:    bmeehan@us.ibm.com
_______________________

Barry,

I downloaded and tested the Tidy tool on five HTML files. The results are
not impressive: very few errors are fixed and even one file was totally
empty after cleaning.

The test consisted of opening the file and run the HTML Tidy "Clean,
correct, convert and format" function. I did not try to customize the
cleaning. I did not find yet how to remove the comments.

I attached hereafter the data (check tool output) before and after
cleaning.



BEFORE CLEANING:
Checker Output:
(See attached file: BEFORE.SUM)


AFTER CLEANING:
Checker Output:
(See attached file: AFTER.TXT)


      (See attached file: Before.txt)            (See attached file:
After.txt)

Received on Friday, 5 October 2001 15:05:50 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:46 GMT