W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2010

Re: option to disable this check: Warning: replacing invalid numeric character reference ?

From: Ilya Basin <basinilya@gmail.com>
Date: Sun, 25 Jul 2010 17:17:28 +0400
Message-ID: <69761532.20100725171728@gmail.com>
To: Ilya Basin <basinilya@gmail.com>
CC: html-tidy@w3.org
IB> some background. I want to convert any html doc to a valid xml (small
IB> losses during conversion are acceptable)
IB> Then I plan to use XPath to extract portions of the document and
IB> finally convert them back to html.

IB> I don't want to care about encoding (in worst case I plan to extract
IB> the charset from http response or from the 'meta' tag when I already
IB> have the xml and put it to the resulting html)

IB> I use tidy options:
IB>   --output-xml yes                           # to obtain xml
IB>   --input-encoding raw --output-encoding raw # so tidy won't try to
IB>                                              # convert national chars
IB>                                              # to '&***;' sequences

IB> I cap the produced xml document with the following string:
IB>   <?xml version="1.0" encoding="ascii"?>
IB> in case that the XML parser (now it's just Firefox) has utf-8 by
IB> default.
IB> *********************************************************************

IB> Now the problem rises: The first complex html I downloaded contained
IB> '&#151;'. And even though I use 'raw' encoding, tidy converts &#151;
IB> to char '0x14' and prints:
IB>     Warning: replacing invalid numeric character reference 151

IB> As the result, firefox prints the following:

IB> XML Parsing Error: not well-formed
IB> Location: file:///.snapshots/persist/builds/sgml/vs/tes3.xml
IB> Line Number 8, Column 22:
IB> <body>bad character: </body>
IB> ---------------------^

IB> script :
IB>   (
IB>   echo '<?xml version="1.0" encoding="ascii"?>'
IB>   cat test.html | tidy --force-output yes --output-xml yes --input-encoding raw --output-encoding raw
IB>   ) >tes3.xml

IB> test.html :
IB>   <html>
IB>     <body>
IB>   bad character: &#151;
IB>     </body>
IB>   </html>

proposed patch (preserve-all-numeric-entities.patch). waiting for
comments.

--- tidy.old/src/lexer.c        2008-03-23 00:06:55.000000000 +0300
+++ tidy.new/src/lexer.c        2010-07-25 17:08:28.000000000 +0400
@@ -900,7 +900,7 @@
     /* deal with unrecognized or invalid entities */
     /* #433012 - fix by Randy Waki 17 Feb 01 */
     /* report invalid NCR's - Terry Teague 01 Sep 01 */
-    if ( !found || (ch >= 128 && ch <= 159) || (ch >= 256 && c != ';') )
+    if ( !found || (!preserveEntities && ((ch >= 128 && ch <= 159) || (ch >= 256 && c != ';'))) )
     {
         /* set error position just before offending character */
         SetLexerLocus( doc, lexer );



-- 
Received on Monday, 26 July 2010 03:10:30 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:14:00 GMT