Re: Proposed changes to UTF-8 draft from Keld Jørn Simonsen on 2003-01-13 (ietf-charsets@w3.org from January to March 2003)

From: Keld Jørn Simonsen <keld@dkuug.dk>
Date: Mon, 13 Jan 2003 18:54:55 +0100
To: Francois Yergeau <FYergeau@alis.com>
Cc: ietf-charsets@iana.org
Message-id: <20030113175455.GA2869@rap.rap.dk>

On Mon, Jan 13, 2003 at 10:46:04AM -0500, Francois Yergeau wrote:
> Keld Jørn Simonsen wrote:
> > It is becacuse UTF-8 in the ISO 10646 definition only encodes 
> > characters
> > defined in 10646. And "surrogates" are not characters. So they "do not
> > occur" in UTF-8. 
> 
> Yes, you're just repeating what the Note in Annex D says.  It's not wrong.
> It's just insufficient: it's a Note (non-normative) and it does not forbid
> (or even warn against) interpreting encoded surrogates.  Or overlong
> sequences.  There is a section that describes certain error cases, but it
> misses those two, thereby implying that they might not be errors.  The
> Unicode 3.2 text is just much tighter (at long last!) and therefore should
> be chosen.

That is not how I read it, the note explains what is obvious from the
architecture, to the reader, that you cannot encode surrogates in utf-8.
It does not, however, warn against overlong sequences, that is true.

Kind regards
keld

Received on Monday, 13 January 2003 12:56:04 UTC