- From: Brian Smith <brian@briansmith.org>
- Date: Fri, 29 Feb 2008 06:00:02 -0800
- To: "'HTML WG'" <public-html@w3.org>
> On Sat, 23 Jun 2007, istein E. Andersen wrote: > > >> > > >>> Bytes or sequences of bytes in the original byte stream > > >>> that could not be converted to Unicode characters must be > > >>> converted to U+FFFD REPLACEMENT CHARACTER code points. > > > > Unicode 5.0 remains vague on this point. (E.g., definition > > D92 defines well-formed and ill-formed UTF-8 byte > > sequences, but conformance requirement C10 only requires > > ill-formed sequences to be treated as an error condition and > > suggests that a one-byte ill-formed > > sequence may be either filtered out or replaced by a U+FFFD > > replacement character.) More generally, character encoding > > specifications can hardly be expected to define proper error > > handling, since they are usually not terribly preoccupied > > with mislabelled data. > > They should define error handling, and are defective if they don't. > However, I agree that many specs are defective. This is certainly not > limited to character encoding specifications. Unicode does define the error handling explicitly. An implementation must handle an ill-formed sequence by "signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character." The second option (silently discarding bad data) is bad, but requiring all implementations to do any U-FFFD substitution is too much of a burden. A lot of deployed UTF-8 decoders do not do substitution, and in some platforms it is not possible to implement a new UTF-8 decoder efficiently (as efficiently as the built-in one, at least). - Brian
Received on Friday, 29 February 2008 14:00:12 UTC