Re: [markup validator] source quoting i18n bug?

Hello Olivier,

I'm off today, so I can only give a preliminary answer.
By the time the source code gets printed, it should all
be converted into UTF-8. But the output looks as if there
is some stuff that isn't UTF-8. I'll try to look into this
next week. What you may want to do is:
- download the page in question
- run iconv on it to convert it to UTF-8
- check for full UTF-8 compliance with a regexp like e.g.
   the one at http://www.w3.org/International/questions/qa-forms-utf-8.html.

If something gets caught, then that would be an error in
iconv.

Regards,    Martin.

At 08:01 04/04/23 +0900, olivier Thereaux wrote:
>While fixing minor validator's validity bugs, I noticed this
>interesting one.
>
>Typical test case: validating the validation output for a shift_jis
>encoded page (in my case, the google.co.jp homepage)
>
>Symptom: in its error output, the validator quotes part of the source
>for the validated page.
>
>relevant check code:
>
>[[
>...
>    print qq{<span class="msg">$msg</span></p>};
>    print qq(<p><code class="input">$line</code></p>);
>...
>]]
>
>$line appears to be a truncated part of the validated markup source,
>which is fine unless the truncating botches up the first characher, as
>shown here:
>[[
><p><code
>class="input">...$BIq(B&#60;/font&#62;&#60;/b&#62;&#38;nbsp;&#38;nbsp;&#38; 
>nbsp;&#38;nbsp;<strong title="Position where error was
>detected.">&#60;</strong>a id=1a class=q
>href=&#34;/imghp?hl=ja&#38;tab=</code></p>
>]]
>on the last one, as shown here:
>[[
><p><code class="input">...;&#38;nbsp;&#60;a id=1a class=q
>href=&#34;/imghp?<strong title="Position where error was
>detected.">h</strong>l=ja&#38;tab=wi&#38;ie=UTF-8&#38;oe=Shift_JIS&#34; 
>&#62;$B(I%!'î(B
>]]
>
>I am far from being an expert on that part of the code, but it seems
>like a typical i18n problem. I am copying Martin, who helped a lot in
>the past in charset detection and transcoding.
>
>Martin, any idea what's going on here and how to fix this?
>--
>olivier
>

Received on Thursday, 22 April 2004 19:55:30 UTC