Re: Unicode error handling (was several messages about handling encodings in HTML)

On Mon, 07 Apr 2008 19:49:50 +0200, Øistein E. Andersen  
<html5@øistein.com> wrote:
> Unicode 5.1 properly defines ill-formed subsequences and makes
> it clear(er) that these shall never impede correct interpretation
> of adjacent, well-formed UTF-8 byte sequences.
>
> Unfortunately, however, no guidance is given as to how many
> replacement characters should be emitted for a multi-byte
> ill-formed subsequence (not even that the number should not
> exceed the number of bytes, but this is clearly intended).
> I do realise, of course, that it may be problematic to make
> this a conformance criterion, but it might be useful if a
> future version of the standard could at least provide a
> suggestion for new implementations.

Isn't this comment better aimed at the Unicode guys? I agree that it would  
be ideal if for input 'charset' and 'byte stream', output 'character  
stream' is always identical regardless of what implementation you pick,  
but the specification does not seem to be developed with that in mind.


-- 
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>

Received on Wednesday, 9 April 2008 12:04:54 UTC