W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2003

JTidy vs TagSoup

From: Reza Ferrydiansyah <ref11+@pitt.edu>
Date: Thu, 20 Nov 2003 19:07:57 -0500
To: html-tidy@w3.org
Message-id: <15748525.1069355277@shrsft6025d.shrs.nt.pitt.edu>

I have had some problems with JTidy's (and I believe Tidy's in general) 
handling of forms inside of table:

For example the html snippet of:
Form testing
<table>
 <form action='pageA'>
  <tr><td><form>Input 1 Form 1: <input name='v'/></td><td>Input 2 Form 1: 
<input name='w'/></form><form action='pageB'>Input 1 Form 2: <input 
name='x'></td></tr>
  <tr><td>Input 2 Form 2: <input name='y'></td></tr></table>Input 3 Form 2: 
<input name='z'></form>

will be turned into

Form testing

<form action='pageA'></form>

<table>
<tr>
<td>
<form>Input 1 Form 1: <input name='v' /></form>
</td>
<td>Input 2 Form 1: <input name='w' />
<form action='pageB'>Input 1 Form 2: <input name='x' /></form>
</td>
</tr>

<tr>
<td>Input 2 Form 2: <input name='y' /></td>
</tr>
</table>

Input 3 Form 2: <input name='z' /> Hello World

Notice that input w, y, z are now input without a form.
I have tried using another HTML cleaner, TagSoup (unfortunately, still a 
lot of errors) which returns:

<form enctype = "application/x-www-form-urlencoded" method = "get" action = 
"pageA">
  <table><tbody><tr><td colspan = "1" rowspan = "1">
   <form enctype = "application/x-www-form-urlencoded" method = "get">Input 
1 Form 1: <input type = "text" name = "v"></input></form></td>
   <td colspan = "1" rowspan = "1">Input 2 Form 1: <input type = "text" 
name = "w"></input></td></tr></tbody>
  </table>
</form>
  <form enctype = "application/x-www-form-urlencoded" method = "get" action 
= "pageB">
    Input 1 Form 2: <input type = "text" name = "x"></input>
    <table><tbody><tr><td colspan = "1" rowspan = "1">Input 2 Form 2: 
<input type = "text" name = "y"></input></td></tr></tbody></table>
    Input 3 Form 2: <input type = "text" name = "z"></input>
  </form>

In here, all inputs are inside a form. although the input v is inside a 
form which is inside a form.

The way I see it, tidy usually deletes erroneous tags.
Therefore
<table>
 <form><tr><td>...</form> ...
will be changed to
 <form>
 <table><tr><td>...</form>...
because a form cannot be inside a table. It is then corrected to
 <form>
</form>
 <table><tr><td>...

While in tagsoup
<table>
 <form><tr><td>...</form> ...
is changed into
<table>
</table>
 <form><tr><td>...</form> ...
and then
<table>
</table>
 <form><table><tr><td>...</form> ...

in most cases I think the tag soup version is much more preferable than the 
tidy version. Anybody know how to do this with jtidy?

Relevantly, one of the task I am doing currently is cleaning up HTML code 
from popular websites. Yahoo, Lycos all have trouble with their form in the 
form of <form> tags located under <table> tags (<table><form><tr><td>...) 
So I am currently having a lot of trouble with this...

--
Reza Ferrydiansyah
Received on Thursday, 20 November 2003 19:12:46 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:54 UTC