- From: Ilya Basin <basinilya@gmail.com>
- Date: Sun, 25 Jul 2010 15:03:40 +0400
- To: html-tidy@w3.org
some background. I want to convert any html doc to a valid xml (small losses during conversion are acceptable) Then I plan to use XPath to extract portions of the document and finally convert them back to html. I don't want to care about encoding (in worst case I plan to extract the charset from http response or from the 'meta' tag when I already have the xml and put it to the resulting html) I use tidy options: --output-xml yes # to obtain xml --input-encoding raw --output-encoding raw # so tidy won't try to # convert national chars # to '&***;' sequences I cap the produced xml document with the following string: <?xml version="1.0" encoding="ascii"?> in case that the XML parser (now it's just Firefox) has utf-8 by default. ********************************************************************* Now the problem rises: The first complex html I downloaded contained '—'. And even though I use 'raw' encoding, tidy converts — to char '0x14' and prints: Warning: replacing invalid numeric character reference 151 As the result, firefox prints the following: XML Parsing Error: not well-formed Location: file:///.snapshots/persist/builds/sgml/vs/tes3.xml Line Number 8, Column 22: <body>bad character: </body> ---------------------^ script : ( echo '<?xml version="1.0" encoding="ascii"?>' cat test.html | tidy --force-output yes --output-xml yes --input-encoding raw --output-encoding raw ) >tes3.xml test.html : <html> <body> bad character: — </body> </html>
Received on Monday, 26 July 2010 03:10:30 UTC