- From: Mark Davis <mark.davis@icu-project.org>
- Date: Fri, 11 Apr 2008 16:55:19 -0700
- To: "John Cowan" <cowan@ccil.org>
- Cc: UTC <unicore@unicode.org>, public-html@w3.org
Received on Friday, 11 April 2008 23:55:52 UTC
The approach for UTF-8 is really just a specialization of a general approach for arbitrary character encodings: - try to take as many valid bytes as you can (according to the validity rules for the encoding) - stop just before any byte that you can't (validly) take - but take at least one byte I don't think it is worth any extra code or processing in UTF-8, for example, to determine that you have a pair of valid surrogates so that you can emit a single U+FFFD instead of two. What does it buy us? Mark On Fri, Apr 11, 2008 at 4:12 PM, John Cowan <cowan@ccil.org> wrote: > �istein E. Andersen (quoted by Mark Davis) scripsit: > > > One notable difference is that overlong sequences as well as UTF-8 > > sequences representing surrogates and characters outside Unicode > > (>10FFFF) will typically map to several replacement characters according > > to your proposal, but to only one in Markus Kuhn's system > > I agree that overlong sequences, surrogates, and old-10646 sequences > should become a single FFFD. > > -- > The first thing you learn in a lawin' family John Cowan > is that there ain't no definite answers cowan@ccil.org > to anything. --Calpurnia in To Kill A Mockingbird > > -- Mark
Received on Friday, 11 April 2008 23:55:52 UTC