Re: Repairing incorrect tag minimisation was Re: Tags lacking a terminating '>' are spotted

At 11:27 am +1300 11/2/02, Richard A. O'Keefe wrote:
>I proposed a simple rule that could recover from missing '>'.
>bfowler@ewitness.co.uk (ewitness - Ben Fowler) wrote:
>
>	(Of course Tidy should close <td> and <tr> tags).
>
>I don't understand whether this refers to adding '>' or to adding </td>.
></tr> and </td> end-tags are not required for legality in HTML 3.2 or
>HTML 4.01.

I meant the latter. I thought that it was tidy's job to normalise
SGML like this. If it is not, then I am simply wrong.

The documentation for tidy includes this ambiguous text:

	Perfecting lists by putting in tags missed out:
	   <body>
	   <li>1st list item
	   <li>2nd list item

	is mapped to

	   <body>
	   <ul>
	   <li>1st list item</li>
	   <li>2nd list item</li>
	   </ul>

1) Tidy is not 'perfecting' it is 'correcting', 'amending' or my
favoured term 'repairing'.

2) It has done two distinct things:
	a) Added a missing <ul> start tag
	b) Normalised omitted </li> end tags

It is this 'passing over' or paraleipsis which is the rhetorical
term <URL: http://www.chem.leeds.ac.uk/roget/entries/460.html >
of normalising the <li> element there that would suggest that
tidy normalises tags as a matter of course.

Independent of the argument in favour of normalising a tidied
document to a canonical form; the benefits of including the
end tags I mentioned include coddling a bug in Netscape which
sometimes fails to render table content when presented with
legal markup omitting these end tags (some 'wysiwyg' editors
omit these tags as a matter of routine),
<URL: http://www.htmlhelp.com/faq/html/all.html#tables >
<URL: 
http://www.evolt.org/article/Don_t_let_Netscape_cramp_your_styles/17/3268/ >
<URL: http://www.hwg.org/resources/faqs/1cssFAQ.html#optiontags >
<URL: http://www.cyber.com.au/users/clancy/clancys_tables.html >

and improving the performance of layout engines when CSS
is in use.
<URL: http://www.htmlhelp.com/faq/html/all.html#page-inconsistent >
<URL: http://www.leeds.ac.uk/acom/html/lesson10text.html >
<URL: http://www.blooberry.com/indexdot/css/syntax/generalbugs.htm >
<URL: http://www.webreference.com/html/tutorial11/1.html >

Logically, to make a page 'tidy' we should either omit all
omissible tags, or include all includable ones, I prefer the
latter, and believe that I can support that preference, perhaps
you differ.

>	I have an objection to the process that generates <td compact>
>	from the input you gave.
>
>	IMHO, Tidy should adopt a "first do no harm" rule.
>
>That would be a nice rule, *IF* one could make it mean something definite.

Surely 'first do no harm' is one of the most definite rules that
there is; which is why I used that form of words.

>	In other words, it should not attempt a partial repair of
>	elements that it does not fully understand.
>
>That rule being adopted, Tidy could never repair anything at all.

The docs give ten examples, including the <ul> container element
that I mentioned earlier

>	In my view, because it is impossible to determine whether the
>	author intended
>
>		<td nowrap="nowrap">
>
>	or
>
>		nowrap is a deprecated attribute for table data cell elements
>
>	Tidy should do nothing.
>
>I fail to see how logging the problem is an advance on logging the
>problem and attempting the repair.  The reason I gave the entire list
>of words that could incorrectly be swallowed was to point out that it is
>a very short list, and most of the "words" are not actual English words
>at all (as this one is not).  For realistic examples, the odds against a
>real word being swallowed in this way are surely extremely low.

All right. You have convinced me. I had thought that there was
a general issue here, but I now see that it is fairly specific,
and you are right.


>	I appreciate that there is a case for changing to <td>, and
>	expecting the author to re-check the page in an HTML editor
>	or browser, but I personally would not do it.
>
>	The best that Tidy could do is log the problem. (Or perhaps
>	add a comment).
>
>That is rather far from the best.  _Whichever_ is the intended correction,
>"<td nowrap> is a deprecated ...." is an excellent starting point for a
>manual correction.  No matter WHAT changes Tidy makes, it is always
>necessary for someone to check the result.

I was taking human nature into account. I suspect that many people
would run the document through tidy and call it a day. Many people
are minded to take the output of computer as 'vox dei'. I would
doubt that in the real world the results would be rigorously checked,
and I suspect that often would not be checked at all.

In the case in point, the check might have to be done in a text editor
rather than an HTML editor.

Ben.

Received on Tuesday, 12 February 2002 03:00:54 UTC