- From: Ilya Basin <basinilya@gmail.com>
- Date: Sun, 25 Jul 2010 17:17:28 +0400
- To: Ilya Basin <basinilya@gmail.com>
- CC: html-tidy@w3.org
IB> some background. I want to convert any html doc to a valid xml (small IB> losses during conversion are acceptable) IB> Then I plan to use XPath to extract portions of the document and IB> finally convert them back to html. IB> I don't want to care about encoding (in worst case I plan to extract IB> the charset from http response or from the 'meta' tag when I already IB> have the xml and put it to the resulting html) IB> I use tidy options: IB> --output-xml yes # to obtain xml IB> --input-encoding raw --output-encoding raw # so tidy won't try to IB> # convert national chars IB> # to '&***;' sequences IB> I cap the produced xml document with the following string: IB> <?xml version="1.0" encoding="ascii"?> IB> in case that the XML parser (now it's just Firefox) has utf-8 by IB> default. IB> ********************************************************************* IB> Now the problem rises: The first complex html I downloaded contained IB> '—'. And even though I use 'raw' encoding, tidy converts — IB> to char '0x14' and prints: IB> Warning: replacing invalid numeric character reference 151 IB> As the result, firefox prints the following: IB> XML Parsing Error: not well-formed IB> Location: file:///.snapshots/persist/builds/sgml/vs/tes3.xml IB> Line Number 8, Column 22: IB> <body>bad character: </body> IB> ---------------------^ IB> script : IB> ( IB> echo '<?xml version="1.0" encoding="ascii"?>' IB> cat test.html | tidy --force-output yes --output-xml yes --input-encoding raw --output-encoding raw IB> ) >tes3.xml IB> test.html : IB> <html> IB> <body> IB> bad character: — IB> </body> IB> </html> proposed patch (preserve-all-numeric-entities.patch). waiting for comments. --- tidy.old/src/lexer.c 2008-03-23 00:06:55.000000000 +0300 +++ tidy.new/src/lexer.c 2010-07-25 17:08:28.000000000 +0400 @@ -900,7 +900,7 @@ /* deal with unrecognized or invalid entities */ /* #433012 - fix by Randy Waki 17 Feb 01 */ /* report invalid NCR's - Terry Teague 01 Sep 01 */ - if ( !found || (ch >= 128 && ch <= 159) || (ch >= 256 && c != ';') ) + if ( !found || (!preserveEntities && ((ch >= 128 && ch <= 159) || (ch >= 256 && c != ';'))) ) { /* set error position just before offending character */ SetLexerLocus( doc, lexer ); --
Received on Monday, 26 July 2010 03:10:30 UTC