Re: UTC Agenda Item: Recommendations for handling ill-formed sequences from Mark Davis on 2008-04-11 (public-html@w3.org from April 2008)

From: Mark Davis <mark.davis@icu-project.org>
Date: Fri, 11 Apr 2008 16:55:19 -0700
To: "John Cowan" <cowan@ccil.org>
Cc: UTC <unicore@unicode.org>, public-html@w3.org
Message-ID: <30b660a20804111655y1a455873n3c41c9f9e1116bbe@mail.gmail.com>

The approach for UTF-8 is really just a specialization of a general approach
for arbitrary character encodings:

   - try to take as many valid bytes as you can (according to the
   validity rules for the encoding)
   - stop just before any byte that you can't (validly) take
   - but take at least one byte

I don't think it is worth any extra code or processing in UTF-8, for
example, to determine that you have a pair of valid surrogates so that you
can emit a single U+FFFD instead of two. What does it buy us?

Mark

On Fri, Apr 11, 2008 at 4:12 PM, John Cowan <cowan@ccil.org> wrote:

> �istein E. Andersen (quoted by Mark Davis) scripsit:
>
> > One notable difference is that overlong sequences as well as UTF-8
> > sequences representing surrogates and characters outside Unicode
> > (>10FFFF) will typically map to several replacement characters according
> > to your proposal, but to only one in Markus Kuhn's system
>
> I agree that overlong sequences, surrogates, and old-10646 sequences
> should become a single FFFD.
>
> --
> The first thing you learn in a lawin' family    John Cowan
> is that there ain't no definite answers         cowan@ccil.org
> to anything.  --Calpurnia in To Kill A Mockingbird
>
>

-- 
Mark

Received on Friday, 11 April 2008 23:55:52 UTC