[Bug 3289] utf8 web site causes tool to break from bugzilla@wiggum.w3.org on 2006-05-17 (www-validator-cvs@w3.org from May 2006)

From: <bugzilla@wiggum.w3.org>
Date: Wed, 17 May 2006 00:53:10 +0000
To: www-validator-cvs@w3.org
CC:
Message-Id: <E1FgAHm-0007Tb-KF@wiggum.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=3289

           Summary: utf8 web site causes tool to break
           Product: LinkChecker
           Version: 4.2.1
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: checklink
        AssignedTo: ville.skytta@iki.fi
        ReportedBy: bruce@altmann.com
         QAContact: www-validator-cvs@w3.org


When checking a utf8 web site, the tool can not handle the encoding and breaks.
(even worse - works on some - but can not read, and thus completely misses
things)


Firs the GET complains

Parsing undecoded UTF-8 will create garbage in ...5.8.8/Protocols.pm line 114
(then it trys to read what is can)


Then (and if can be the first warning if the header was raw enough to get by)
"Parsing..."
(so clearly in &parse_document)
Complains again
"Parsing undecoded UTF-8 will create garbage in checklink  line #.



search.cpan.org  mentions this error in HTML::Parser
says to Encode::encode_utf8 before calling parse.
(but the example is a little sparse)


Request:

Can you explain to me 2 things for a possible code tweak on my part.
The code seems to know the encoding (web version reports utf8)
Is this correct?  What part of the code identifies this? (or does it jsut read
it from the header)

What 2 points in the code need to be told (hey this is utf8)
(either from the code already knowing, or passing this in as a --encoding XXX
command line arg)
I assume something before the GET
and something before the parse in &parse_document.

-Bruce

Received on Wednesday, 17 May 2006 00:53:13 UTC