Illegal cp1252 numeric entities not converted to the right unicode numeric entities from Willis Morse on 2008-05-02 (html-tidy@w3.org from April to June 2008)

From: Willis Morse <willismorse@mac.com>
Date: Fri, 2 May 2008 13:45:22 -0400
To: html-tidy@w3.org
Message-Id: <9B903147-67AB-42AB-928D-021E67B356B8@mac.com>

I have some legacy HTML that uses illegal numeric entities within the  
range 128-159.  The Doctype on these documents is HTML 4 Transitional.

I am running a current HTMLTIdy build from CVS. I use the following  
encoding parameters:

	input-encoding: win1252
	output-encoding: utf8

The illegal numeric entities are flagged as errors and replaced with  
Unicode numeric entities. Unfortunately, the new entities are the  
wrong characters.
	
For example, I have many occurrences of & #134; which corresponds to  
a "dagger" character in windows cp1252 encoding. HTMLTidy  converts  
this dagger to a Uuml. Depending upon config parameters, this may  
result in a UTF8 Uuml character or in the equivalent numeric entity &  
#220;

What I would like HTMLTidy is to recognize these bogus numeric  
entities and map them to the equivalent unicode numeric entity. In  
the case of the dagger, this would be:       $ #8224;

Can HTMLTidy do this? If so, any suggestions for parameter  
combinations that would cause this to happen?

The parameters I've been playing with are:

	output-xhtml
	numeric-entities
	doctype
	clean
	bare

	
Thanks for any help,
Willis Morse

Received on Sunday, 4 May 2008 15:25:32 UTC