W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2010

option to disable this check: Warning: replacing invalid numeric character reference ?

From: Ilya Basin <basinilya@gmail.com>
Date: Sun, 25 Jul 2010 15:03:40 +0400
Message-ID: <1971211381.20100725150340@gmail.com>
To: html-tidy@w3.org
some background. I want to convert any html doc to a valid xml (small
losses during conversion are acceptable)
Then I plan to use XPath to extract portions of the document and
finally convert them back to html.

I don't want to care about encoding (in worst case I plan to extract
the charset from http response or from the 'meta' tag when I already
have the xml and put it to the resulting html)

I use tidy options:
  --output-xml yes                           # to obtain xml
  --input-encoding raw --output-encoding raw # so tidy won't try to
                                             # convert national chars
                                             # to '&***;' sequences

I cap the produced xml document with the following string:
  <?xml version="1.0" encoding="ascii"?>
in case that the XML parser (now it's just Firefox) has utf-8 by
default.
*********************************************************************

Now the problem rises: The first complex html I downloaded contained
'&#151;'. And even though I use 'raw' encoding, tidy converts &#151;
to char '0x14' and prints:
    Warning: replacing invalid numeric character reference 151

As the result, firefox prints the following:

XML Parsing Error: not well-formed
Location: file:///.snapshots/persist/builds/sgml/vs/tes3.xml
Line Number 8, Column 22:
<body>bad character: </body>
---------------------^

script :
  (
  echo '<?xml version="1.0" encoding="ascii"?>'
  cat test.html | tidy --force-output yes --output-xml yes --input-encoding raw --output-encoding raw
  ) >tes3.xml

test.html :
  <html>
    <body>
  bad character: &#151;
    </body>
  </html>
Received on Monday, 26 July 2010 03:10:30 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:14:00 GMT