W3C home > Mailing lists > Public > public-html@w3.org > April 2008

Re: UTC Agenda Item: Recommendations for handling ill-formed sequences

From: Mark Davis <mark.davis@icu-project.org>
Date: Fri, 11 Apr 2008 16:55:19 -0700
Message-ID: <30b660a20804111655y1a455873n3c41c9f9e1116bbe@mail.gmail.com>
To: "John Cowan" <cowan@ccil.org>
Cc: UTC <unicore@unicode.org>, public-html@w3.org
The approach for UTF-8 is really just a specialization of a general approach
for arbitrary character encodings:

   - try to take as many valid bytes as you can (according to the
   validity rules for the encoding)
   - stop just before any byte that you can't (validly) take
   - but take at least one byte

I don't think it is worth any extra code or processing in UTF-8, for
example, to determine that you have a pair of valid surrogates so that you
can emit a single U+FFFD instead of two. What does it buy us?

Mark

On Fri, Apr 11, 2008 at 4:12 PM, John Cowan <cowan@ccil.org> wrote:

> �istein E. Andersen (quoted by Mark Davis) scripsit:
>
> > One notable difference is that overlong sequences as well as UTF-8
> > sequences representing surrogates and characters outside Unicode
> > (>10FFFF) will typically map to several replacement characters according
> > to your proposal, but to only one in Markus Kuhn's system
>
> I agree that overlong sequences, surrogates, and old-10646 sequences
> should become a single FFFD.
>
> --
> The first thing you learn in a lawin' family    John Cowan
> is that there ain't no definite answers         cowan@ccil.org
> to anything.  --Calpurnia in To Kill A Mockingbird
>
>


-- 
Mark
Received on Friday, 11 April 2008 23:55:52 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:14 GMT