- From: <bugzilla@wiggum.w3.org>
- Date: Wed, 17 May 2006 00:53:10 +0000
- To: www-validator-cvs@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3289
Summary: utf8 web site causes tool to break
Product: LinkChecker
Version: 4.2.1
Platform: PC
OS/Version: Linux
Status: NEW
Severity: major
Priority: P2
Component: checklink
AssignedTo: ville.skytta@iki.fi
ReportedBy: bruce@altmann.com
QAContact: www-validator-cvs@w3.org
When checking a utf8 web site, the tool can not handle the encoding and breaks.
(even worse - works on some - but can not read, and thus completely misses
things)
Firs the GET complains
Parsing undecoded UTF-8 will create garbage in ...5.8.8/Protocols.pm line 114
(then it trys to read what is can)
Then (and if can be the first warning if the header was raw enough to get by)
"Parsing..."
(so clearly in &parse_document)
Complains again
"Parsing undecoded UTF-8 will create garbage in checklink line #.
search.cpan.org mentions this error in HTML::Parser
says to Encode::encode_utf8 before calling parse.
(but the example is a little sparse)
Request:
Can you explain to me 2 things for a possible code tweak on my part.
The code seems to know the encoding (web version reports utf8)
Is this correct? What part of the code identifies this? (or does it jsut read
it from the header)
What 2 points in the code need to be told (hey this is utf8)
(either from the code already knowing, or passing this in as a --encoding XXX
command line arg)
I assume something before the GET
and something before the parse in &parse_document.
-Bruce
Received on Wednesday, 17 May 2006 00:53:13 UTC