Re: UTC Agenda Item: Recommendations for handling ill-formed sequences

The approach for UTF-8 is really just a specialization of a general approach
for arbitrary character encodings:

   - try to take as many valid bytes as you can (according to the
   validity rules for the encoding)
   - stop just before any byte that you can't (validly) take
   - but take at least one byte

I don't think it is worth any extra code or processing in UTF-8, for
example, to determine that you have a pair of valid surrogates so that you
can emit a single U+FFFD instead of two. What does it buy us?

Mark

On Fri, Apr 11, 2008 at 4:12 PM, John Cowan <cowan@ccil.org> wrote:

> �istein E. Andersen (quoted by Mark Davis) scripsit:
>
> > One notable difference is that overlong sequences as well as UTF-8
> > sequences representing surrogates and characters outside Unicode
> > (>10FFFF) will typically map to several replacement characters according
> > to your proposal, but to only one in Markus Kuhn's system
>
> I agree that overlong sequences, surrogates, and old-10646 sequences
> should become a single FFFD.
>
> --
> The first thing you learn in a lawin' family    John Cowan
> is that there ain't no definite answers         cowan@ccil.org
> to anything.  --Calpurnia in To Kill A Mockingbird
>
>


-- 
Mark

Received on Friday, 11 April 2008 23:55:52 UTC