RE: Unicode error handling (was several messages about handling encodings in HTML) from Brian Smith on 2008-02-29 (public-html@w3.org from February 2008)

From: Brian Smith <brian@briansmith.org>
Date: Fri, 29 Feb 2008 06:00:02 -0800
To: "'HTML WG'" <public-html@w3.org>
Message-ID: <004001c87adb$613c08d0$6401a8c0@T60>

> On Sat, 23 Jun 2007, istein E. Andersen wrote:
> > >> 
> > >>> Bytes or sequences of bytes in the original byte stream 
> > >>> that could not be converted to Unicode characters must be 
> > >>> converted to U+FFFD REPLACEMENT CHARACTER code points.
> >
> > Unicode 5.0 remains vague on this point. (E.g., definition 
> > D92 defines well-formed and ill-formed UTF-8 byte
> > sequences, but conformance requirement C10 only requires
> > ill-formed sequences to be treated as an error condition and
> > suggests that a one-byte ill-formed 
> > sequence may be either filtered out or replaced by a U+FFFD
> > replacement character.) More generally, character encoding
> > specifications can hardly be expected to define proper error
> > handling, since they are usually not terribly preoccupied
> > with mislabelled data.
> 
> They should define error handling, and are defective if they don't. 
> However, I agree that many specs are defective. This is certainly not 
> limited to character encoding specifications.

Unicode does define the error handling explicitly. An implementation must handle an ill-formed sequence by "signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character." The second option (silently discarding bad data) is bad, but requiring all implementations to do any U-FFFD substitution is too much of a burden. A lot of deployed UTF-8 decoders do not do substitution, and in some platforms it is not possible to implement a new UTF-8 decoder efficiently (as efficiently as the built-in one, at least).

- Brian

Received on Friday, 29 February 2008 14:00:12 UTC