REQ: suppressing unknown attributes

Following features I miss in tidy30apr00 and would very much like to see 
them added into a future version:

- Suppression of unknown attributes (while the next one is easy to 
accomplish with perl, I don't know what attributes are unknown, thus can't 
delete them)

- Suppression of empty font tags (<FONT SIZE=5></FONT> does nothing in 
HTML, why not delete it)

- a smarter guessing routine for unquoted attributes
a common misspelling that I often see is like this:
	<FONT FACE=Arial, Helvetica COLOR=navy blue>
Tidy "cleans" it to <FONT FACE="Arial," HELVETICA="" COLOR="navy" BLUE="">
while I rather correct this to <FONT FACE="Arial, Helvetica" COLOR="navy 
blue"> (even though this would look brown then, I don't care, that's the 
writer's fault, but at least I am creating valid HTML 4.0 Transitional like 
this)

- maybe a list to correct common misspellings... brits and aussies for 
example like to write <FONT COLOUR="blue">, which produces a validation 
error. changing this to <FONT COLOR="blue"> is a safe guess. I am going to 
compile such a list over the next weeks and would happily pass it on, if 
requested.



My Tidy wrapper in Perl is going to be finished by the end of this week, I 
will publish it then. Many thanks to Pete Gelbman, Mikael Hultgren and Mike 
Depot for their helpful support.


for those that are interested, here is my loop to quote unquoted attributes 
(color and face only for now, as I haven't seen need for others yet).
Any kind of feedback, suggestions, requests or some teaching about how to 
do this in a better way are very welcome! (I had to add the comments, 
otherwise I wouldn't understand what I hacked up myself.)

while ($FormData{$key} =~ m/<[^>]*(color|face)\s*=\s*[^"][^>]*>/i) {
# add quotation marks to unquoted COLOR & FACE attributes
$FormData{$key} =~ s/
   (<[^>]*?)             # start of tag, e.g. '<FONT '              $1
   (color|face)          # the attribute name that we want quoted   $2
   \s*=\s*               # the equal sign, maybe with spaces around
   ([^>|"|=]+?)          # the unquoted attribute value             $3
   (?=                   # look ahead
     (\s+[a-z]+\s*=\s*)  # next attribute (one word before a '=' sign)
     |
     >                   # end of tag
   )                     # finish looking
/$1$2="$3"/gimx;
}
--
Sebastian Lange
http://www.sl-chat.de/
Maybe the first chat site that validates as HTML
4.0 even though user input may contain HTML codes.

Courtesy to Dave Raggett's HTML Tidy:
http://www.w3.org/People/Raggett/tidy/

Received on Wednesday, 10 May 2000 04:54:22 UTC