W3C home > Mailing lists > Public > www-validator@w3.org > July 2012

Re: validator Tidy HTML adds DTD without systemId to quirks documents.

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Sun, 29 Jul 2012 10:37:16 +0300
Message-ID: <5014E82C.2030801@cs.tut.fi>
To: Rob^_^ <iecustomizer@hotmail.com>
CC: "w3.org Validator List" <www-validator@w3.org>
2012-07-29 5:40, Rob^_^ wrote:

> consider this simple html document.
[...]
> which the w3c validator ‘Tidy html’ corrects to
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
[...]

You are referring to the "Clean up Markup with HTML-Tidy" option in the 
extended user interface ("More Options") at <http://validator.w3.org>. 
As the result of taking that option tells, "HTML-Tidy is a third-party 
software not developed at W3C, and its output is /provided without any 
guarantee/".

The option is more or less bogus. Just don't use it. In addition to the 
feature you have observed, the option causes the incomplete HTML 3.2 
doctype to be emitted even if you used a different doctype or implied it 
to the HTML 4.01 doctype. Moreover, when getting rid of presentational 
markup, HTML-Tidy uses automatically generated class names, so the 
result is really less readable than the original. And it can go very 
wrong. Consider this:

<!doctype html>
<title>Hello world</title>
<p class=c1>Hello
<p align=center>Hi!

This results in the following "tidied" document:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy for Linux (vers 25 March 
2009), see www.w3.org">
<title>Hello world</title>

<style type="text/css">
  p.c1 {text-align: center}
</style>
</head>
<body>
<p class="c1">Hello</p>
<p class="c1">Hi!</p>
</body>
</html>

Not only has it changed the HTML5 doctype (ensuring Standards Mode as 
far as possible) to an HTML 3.2 doctype, with the risk of Quirks Mode. 
It has also "cleaned up" align=center by introducing the class name c1 
and associated CSS, without checking that the name is already in use, so 
this would end up with centering both paragraphs (plus applying whatever 
external CSS might apply to class c1).

In addition to this, people trying to use HTML5 have been misled into 
thinking that HTML-Tidy generally fixes presentational markup, 
converting it to use CSS. For example, if you submit the following 
document, you will get an error message, saying "The width attribute on 
the td element is obsolete. Use CSS instead.":

<!doctype html>
<title>Hello world</title>
<table><tr><td width=100>foo</table>

Now, as HTML-Tidy has been advertised to fix problems of this type, and 
since it is available as an option in the validator's user interface, 
people take this option and get a "tidied" version - which has the same 
<table> markup, just with different formatting.

It gets worse. Suppose you are validating an HTML5 document, with 
<!doctype html>, containing some element introduced i HTML5, say 
<aside>What is going on?</aside> to the document. You will get no error 
message about it of course, since you are validating with HTML5, but if 
you use the HTML-Tidy option, the "tidied" document has been silently 
ripped off of the <aside> and </aside> tags. (This happens when there is 
_some_ error message.)

The "Tidy-HTML" option should simply be removed. The Tidy-HTML software 
should be used, at most, by people who know well what it really does. 
And such people can surely run it separately on their documents.

Yucca
Received on Sunday, 29 July 2012 07:37:48 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 29 July 2012 07:37:54 GMT