- From: Ilya Basin <basinilya@gmail.com>
- Date: Sun, 25 Jul 2010 15:03:40 +0400
- To: html-tidy@w3.org
some background. I want to convert any html doc to a valid xml (small
losses during conversion are acceptable)
Then I plan to use XPath to extract portions of the document and
finally convert them back to html.
I don't want to care about encoding (in worst case I plan to extract
the charset from http response or from the 'meta' tag when I already
have the xml and put it to the resulting html)
I use tidy options:
--output-xml yes # to obtain xml
--input-encoding raw --output-encoding raw # so tidy won't try to
# convert national chars
# to '&***;' sequences
I cap the produced xml document with the following string:
<?xml version="1.0" encoding="ascii"?>
in case that the XML parser (now it's just Firefox) has utf-8 by
default.
*********************************************************************
Now the problem rises: The first complex html I downloaded contained
'—'. And even though I use 'raw' encoding, tidy converts —
to char '0x14' and prints:
Warning: replacing invalid numeric character reference 151
As the result, firefox prints the following:
XML Parsing Error: not well-formed
Location: file:///.snapshots/persist/builds/sgml/vs/tes3.xml
Line Number 8, Column 22:
<body>bad character: </body>
---------------------^
script :
(
echo '<?xml version="1.0" encoding="ascii"?>'
cat test.html | tidy --force-output yes --output-xml yes --input-encoding raw --output-encoding raw
) >tes3.xml
test.html :
<html>
<body>
bad character: —
</body>
</html>
Received on Monday, 26 July 2010 03:10:30 UTC