Re: W3C "cannot interpret as utf-8" from Martin Duerst on 2002-09-02 (www-validator@w3.org from September 2002)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 02 Sep 2002 11:26:53 +0900
To: Tim Bagot <tsb-w3-validator-0006@earth.li>, <www-validator@w3.org>
Message-Id: <4.2.0.58.J.20020902112129.064fb768@localhost>

I don't see a bug in the original document
(http://www.comiclist.com/lists/search.html) anymore;
I assume it has been fixed.

Tim is of course correct, an \xa0 *byte* is not valid UTF-8 at all.
The *character* U+00A0 is legal (X)HTML (it's &nbsp;), but that
character is represented as the two bytes \xC2\xA0 in UTF-8.

If Page Valet or the WDG validator were happy with a single \xa0
*byte* in UTF-8, then they have a bug and should be fixed.

Regards,    Martin.

At 12:10 02/09/01 +0000, Tim Bagot wrote:

>At 2002-09-01T11:37-0000, Nick Kew wrote:-
>
> > In article <n6l3nuso0b71kdg7mr6lemgls8bi20bh5i@4ax.com>, one of 
> infinite monkeys
> >       at the keyboard of Charles LePage <chuck@comiclist.com> wrote:
>
> > >             "Sorry, I am unable to validate this document because on
> > > lines 58, 59 it contained some byte(s) that I cannot interpret as utf-8.
> > > Please check both the content of the file and the character encoding
> > > indication."
>
> > For want of a better explanation, I'd guess it's your \xa0 characters.
> > That is certainly a bug in the validator,
>
>I don't see why it would be a bug in the validator. A0 by itself is not a
>valid UTF-8 octet sequence. It is valid only as a second or subsequent
>octet of a multi-octet sequence.
>
>
>Tim Bagot

Received on Sunday, 1 September 2002 23:09:01 UTC