[markup validator] source quoting i18n bug? from olivier Thereaux on 2004-04-22 (www-validator@w3.org from April 2004)

From: olivier Thereaux <ot@w3.org>
Date: Fri, 23 Apr 2004 08:01:22 +0900
To: validators community <www-validator@w3.org>
Cc: Martin Duerst <duerst@w3.org>
Message-Id: <F985656A-94B0-11D8-9612-000393A63FC8@w3.org>

While fixing minor validator's validity bugs, I noticed this  
interesting one.

Typical test case: validating the validation output for a shift_jis  
encoded page (in my case, the google.co.jp homepage)

Symptom: in its error output, the validator quotes part of the source  
for the validated page.

relevant check code:

[[
...
    print qq{<span class="msg">$msg</span></p>};
    print qq(<p><code class="input">$line</code></p>);
...
]]

$line appears to be a truncated part of the validated markup source,  
which is fine unless the truncating botches up the first characher, as  
shown here:
[[
<p><code  
class="input">...Éñ&#60;/font&#62;&#60;/b&#62;&#38;nbsp;&#38;nbsp;&#38; 
nbsp;&#38;nbsp;<strong title="Position where error was  
detected.">&#60;</strong>a id=1a class=q  
href=&#34;/imghp?hl=ja&#38;tab=</code></p>
]]
on the last one, as shown here:
[[
<p><code class="input">...;&#38;nbsp;&#60;a id=1a class=q  
href=&#34;/imghp?<strong title="Position where error was  
detected.">h</strong>l=ja&#38;tab=wi&#38;ie=UTF-8&#38;oe=Shift_JIS&#34; 
&#62;„Ç§„</code></p>
]]

I am far from being an expert on that part of the code, but it seems  
like a typical i18n problem. I am copying Martin, who helped a lot in  
the past in charset detection and transcoding.

Martin, any idea what's going on here and how to fix this?
-- 
olivier

Received on Thursday, 22 April 2004 19:02:00 UTC