Illegal cp1252 numeric entities not converted to the right unicode numeric entities

I have some legacy HTML that uses illegal numeric entities within the  
range 128-159.  The Doctype on these documents is HTML 4 Transitional.

I am running a current HTMLTIdy build from CVS. I use the following  
encoding parameters:

	input-encoding: win1252
	output-encoding: utf8

The illegal numeric entities are flagged as errors and replaced with  
Unicode numeric entities. Unfortunately, the new entities are the  
wrong characters.
	
For example, I have many occurrences of & #134; which corresponds to  
a "dagger" character in windows cp1252 encoding. HTMLTidy  converts  
this dagger to a Uuml. Depending upon config parameters, this may  
result in a UTF8 Uuml character or in the equivalent numeric entity &  
#220;

What I would like HTMLTidy is to recognize these bogus numeric  
entities and map them to the equivalent unicode numeric entity. In  
the case of the dagger, this would be:       $ #8224;

Can HTMLTidy do this? If so, any suggestions for parameter  
combinations that would cause this to happen?

The parameters I've been playing with are:

	output-xhtml
	numeric-entities
	doctype
	clean
	bare

	
Thanks for any help,
Willis Morse

Received on Sunday, 4 May 2008 15:25:32 UTC