- From: Martin Duerst <duerst@w3.org>
- Date: Fri, 23 Apr 2004 08:55:10 +0900
- To: olivier Thereaux <ot@w3.org>, validators community <www-validator@w3.org>
Hello Olivier,
I'm off today, so I can only give a preliminary answer.
By the time the source code gets printed, it should all
be converted into UTF-8. But the output looks as if there
is some stuff that isn't UTF-8. I'll try to look into this
next week. What you may want to do is:
- download the page in question
- run iconv on it to convert it to UTF-8
- check for full UTF-8 compliance with a regexp like e.g.
the one at http://www.w3.org/International/questions/qa-forms-utf-8.html.
If something gets caught, then that would be an error in
iconv.
Regards, Martin.
At 08:01 04/04/23 +0900, olivier Thereaux wrote:
>While fixing minor validator's validity bugs, I noticed this
>interesting one.
>
>Typical test case: validating the validation output for a shift_jis
>encoded page (in my case, the google.co.jp homepage)
>
>Symptom: in its error output, the validator quotes part of the source
>for the validated page.
>
>relevant check code:
>
>[[
>...
> print qq{<span class="msg">$msg</span></p>};
> print qq(<p><code class="input">$line</code></p>);
>...
>]]
>
>$line appears to be a truncated part of the validated markup source,
>which is fine unless the truncating botches up the first characher, as
>shown here:
>[[
><p><code
>class="input">...$BIq(B</font></b>&nbsp;&nbsp;&
>nbsp;&nbsp;<strong title="Position where error was
>detected."><</strong>a id=1a class=q
>href="/imghp?hl=ja&tab=</code></p>
>]]
>on the last one, as shown here:
>[[
><p><code class="input">...;&nbsp;<a id=1a class=q
>href="/imghp?<strong title="Position where error was
>detected.">h</strong>l=ja&tab=wi&ie=UTF-8&oe=Shift_JIS"
>>$B(I%!'î(B
>]]
>
>I am far from being an expert on that part of the code, but it seems
>like a typical i18n problem. I am copying Martin, who helped a lot in
>the past in charset detection and transcoding.
>
>Martin, any idea what's going on here and how to fix this?
>--
>olivier
>
Received on Thursday, 22 April 2004 19:55:30 UTC