- From: Willis Morse <willismorse@mac.com>
- Date: Fri, 2 May 2008 13:45:22 -0400
- To: html-tidy@w3.org
- Message-Id: <9B903147-67AB-42AB-928D-021E67B356B8@mac.com>
I have some legacy HTML that uses illegal numeric entities within the range 128-159. The Doctype on these documents is HTML 4 Transitional. I am running a current HTMLTIdy build from CVS. I use the following encoding parameters: input-encoding: win1252 output-encoding: utf8 The illegal numeric entities are flagged as errors and replaced with Unicode numeric entities. Unfortunately, the new entities are the wrong characters. For example, I have many occurrences of & #134; which corresponds to a "dagger" character in windows cp1252 encoding. HTMLTidy converts this dagger to a Uuml. Depending upon config parameters, this may result in a UTF8 Uuml character or in the equivalent numeric entity & #220; What I would like HTMLTidy is to recognize these bogus numeric entities and map them to the equivalent unicode numeric entity. In the case of the dagger, this would be: $ #8224; Can HTMLTidy do this? If so, any suggestions for parameter combinations that would cause this to happen? The parameters I've been playing with are: output-xhtml numeric-entities doctype clean bare Thanks for any help, Willis Morse
Received on Sunday, 4 May 2008 15:25:32 UTC