W3C home > Mailing lists > Public > public-qa-dev@w3.org > September 2004

Re: link checker and IRIs

From: Martin Duerst <duerst@w3.org>
Date: Fri, 03 Sep 2004 10:41:20 +0900
Message-Id: <>
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-qa-dev@w3.org

At 11:04 04/09/02 +0900, Martin Duerst wrote:
>> >>It would also be good if you could implement any transcoding stuff, etc.
>> >>in a way compatible with perlunicode, setting the UTF-8 flag etc.
>I have started working on that. I had to test what Perl actually meant
>with 'illegal UTF-8'. My findings were that it didn't complain about
>3-byte surrogates, and allowed characters >U+10FFFF. But otherwise,
>I have found that using Encode, I can reduce the code quite a bit.
>Unfortunately, I got stuck with some very weird phenomenon:
>Many Japanese pages (shift_jis and other) work very well with my
>new code, but the Google JP page just won't transcode, see
>a rset=%28detect+automatically%29&doctype=%28detect+automatically%29&ss=1.
>I have verified that the transcoding code is actually used, that
>the resulting lines have the UTF8 flag set, and also that the
>pattern of readable characters and garbage that you can see
>is still shift_jis. Any advice on what to test next is highly

That problem is now mostly solved, it was because my code didn't
convert the last line of a file, and google.co.jp had everything
interesting in a single long last line.

Regards,     Martin.s
Received on Friday, 3 September 2004 01:41:36 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:54:46 UTC