W3C home > Mailing lists > Public > public-qa-dev@w3.org > September 2004

Re: link checker and IRIs

From: Martin Duerst <duerst@w3.org>
Date: Fri, 03 Sep 2004 10:41:20 +0900
Message-Id: <4.2.0.58.J.20040903104005.05da25e0@localhost>
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-qa-dev@w3.org

At 11:04 04/09/02 +0900, Martin Duerst wrote:
>> >>It would also be good if you could implement any transcoding stuff, etc.
>> >>in a way compatible with perlunicode, setting the UTF-8 flag etc.
>
>I have started working on that. I had to test what Perl actually meant
>with 'illegal UTF-8'. My findings were that it didn't complain about
>3-byte surrogates, and allowed characters >U+10FFFF. But otherwise,
>I have found that using Encode, I can reduce the code quite a bit.
>
>Unfortunately, I got stuck with some very weird phenomenon:
>Many Japanese pages (shift_jis and other) work very well with my
>new code, but the Google JP page just won't transcode, see
>http://qa-dev.w3.org/wmvs/duerst/check?uri=http%3A%2F%2Fwww.google.co.jp&ch 
>a rset=%28detect+automatically%29&doctype=%28detect+automatically%29&ss=1.
>
>I have verified that the transcoding code is actually used, that
>the resulting lines have the UTF8 flag set, and also that the
>pattern of readable characters and garbage that you can see
>is still shift_jis. Any advice on what to test next is highly
>appreciated!

That problem is now mostly solved, it was because my code didn't
convert the last line of a file, and google.co.jp had everything
interesting in a single long last line.


Regards,     Martin.s
Received on Friday, 3 September 2004 01:41:36 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 19 August 2010 18:12:44 GMT