- From: olivier Thereaux <ot@w3.org>
- Date: Fri, 23 Apr 2004 08:01:22 +0900
- To: validators community <www-validator@w3.org>
- Cc: Martin Duerst <duerst@w3.org>
- Message-Id: <F985656A-94B0-11D8-9612-000393A63FC8@w3.org>
While fixing minor validator's validity bugs, I noticed this
interesting one.
Typical test case: validating the validation output for a shift_jis
encoded page (in my case, the google.co.jp homepage)
Symptom: in its error output, the validator quotes part of the source
for the validated page.
relevant check code:
[[
...
print qq{<span class="msg">$msg</span></p>};
print qq(<p><code class="input">$line</code></p>);
...
]]
$line appears to be a truncated part of the validated markup source,
which is fine unless the truncating botches up the first characher, as
shown here:
[[
<p><code
class="input">...Éñ</font></b>&nbsp;&nbsp;&
nbsp;&nbsp;<strong title="Position where error was
detected."><</strong>a id=1a class=q
href="/imghp?hl=ja&tab=</code></p>
]]
on the last one, as shown here:
[[
<p><code class="input">...;&nbsp;<a id=1a class=q
href="/imghp?<strong title="Position where error was
detected.">h</strong>l=ja&tab=wi&ie=UTF-8&oe=Shift_JIS"
>„ǧ„</code></p>
]]
I am far from being an expert on that part of the code, but it seems
like a typical i18n problem. I am copying Martin, who helped a lot in
the past in charset detection and transcoding.
Martin, any idea what's going on here and how to fix this?
--
olivier
Received on Thursday, 22 April 2004 19:02:00 UTC