Re: Fallback to UTF-8 from olivier Thereaux on 2008-05-05 (www-validator@w3.org from May 2008)

From: olivier Thereaux <ot@w3.org>
Date: Mon, 5 May 2008 10:21:12 +0900
To: Andreas Prilop <prilop2008@trashmail.net>
Cc: www-validator@w3.org
Message-Id: <0B923BB8-F23E-4BE6-AC6B-7EAA9100D7D5@w3.org>

On 2-May-08, at 11:09 PM, Andreas Prilop wrote:
> With UTF-8 or Windows-1252 assumed, the W3C validator simply gives up
> and does nothing
>
>   "Sorry! This document can not be checked."
>
> when it finds some byte (or byte sequence) that it cannot
> interpret as Windows-1252 or UTF-8.

Which is why the validator was patched to try latin-1, after utf-8 and  
win-1252.  Can you give it a look?

http://qa-dev.w3.org/wmvs/HEAD/

> The W3C validator just reports "non SGML character number ...",
> which is still better than to sit there and to do nothing.

Arguably. For experts in SGML and markup languages, yes, "non SGML  
character" is an obvious sign of an encoding issue. For most people,  
however, "non SGML character number" is gibberish, whereas "sorry,  
there is a problem because I could not determine the encoding of your  
document" is somewhat understandable.

Received on Monday, 5 May 2008 01:21:45 UTC