- From: Ilya Basin <basinilya@gmail.com>
- Date: Sun, 25 Jul 2010 17:17:28 +0400
- To: Ilya Basin <basinilya@gmail.com>
- CC: html-tidy@w3.org
IB> some background. I want to convert any html doc to a valid xml (small
IB> losses during conversion are acceptable)
IB> Then I plan to use XPath to extract portions of the document and
IB> finally convert them back to html.
IB> I don't want to care about encoding (in worst case I plan to extract
IB> the charset from http response or from the 'meta' tag when I already
IB> have the xml and put it to the resulting html)
IB> I use tidy options:
IB> --output-xml yes # to obtain xml
IB> --input-encoding raw --output-encoding raw # so tidy won't try to
IB> # convert national chars
IB> # to '&***;' sequences
IB> I cap the produced xml document with the following string:
IB> <?xml version="1.0" encoding="ascii"?>
IB> in case that the XML parser (now it's just Firefox) has utf-8 by
IB> default.
IB> *********************************************************************
IB> Now the problem rises: The first complex html I downloaded contained
IB> '—'. And even though I use 'raw' encoding, tidy converts —
IB> to char '0x14' and prints:
IB> Warning: replacing invalid numeric character reference 151
IB> As the result, firefox prints the following:
IB> XML Parsing Error: not well-formed
IB> Location: file:///.snapshots/persist/builds/sgml/vs/tes3.xml
IB> Line Number 8, Column 22:
IB> <body>bad character: </body>
IB> ---------------------^
IB> script :
IB> (
IB> echo '<?xml version="1.0" encoding="ascii"?>'
IB> cat test.html | tidy --force-output yes --output-xml yes --input-encoding raw --output-encoding raw
IB> ) >tes3.xml
IB> test.html :
IB> <html>
IB> <body>
IB> bad character: —
IB> </body>
IB> </html>
proposed patch (preserve-all-numeric-entities.patch). waiting for
comments.
--- tidy.old/src/lexer.c 2008-03-23 00:06:55.000000000 +0300
+++ tidy.new/src/lexer.c 2010-07-25 17:08:28.000000000 +0400
@@ -900,7 +900,7 @@
/* deal with unrecognized or invalid entities */
/* #433012 - fix by Randy Waki 17 Feb 01 */
/* report invalid NCR's - Terry Teague 01 Sep 01 */
- if ( !found || (ch >= 128 && ch <= 159) || (ch >= 256 && c != ';') )
+ if ( !found || (!preserveEntities && ((ch >= 128 && ch <= 159) || (ch >= 256 && c != ';'))) )
{
/* set error position just before offending character */
SetLexerLocus( doc, lexer );
--
Received on Monday, 26 July 2010 03:10:30 UTC