- From: Øistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Fri, 29 Feb 2008 20:56:31 +0100
- To: public-html@w3.org
Brian Smith wrote: > Unicode does define the error handling explicitly. An > implementation must handle an ill-formed sequence by > "signaling an error, filtering the code unit out, or > representing the code unit with a marker such as U+FFFD > replacement character." I assume you are quoting from C10 on p. 73 of Unicode 5.0. The quote looks a bit different in context: For example, in UTF-8 every code unit of the form 110xxxxx_2 /must/ be followed by a code unit of the form 10xxxxxx_2. A sequence such as 110xxxxx_2 0xxxxxxx_2 is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110xxxxx_2 as an illegally terminated code unit sequence---for example, by signalling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD REPLACEMENT CHARACTER. Most notably, the term ``code unit'' does not necessarily refer to any code unit that is part of an ill-formed sequence, but at most to a code unit which cannot occur in isolation (and therefore constitutes an ill-formed sequence) and which is not immediately adjacent to other ill-formed sequences. Exactly how to handle other types of ill-formed sequences cannot really be inferred from the example. My interpretation of the text quoted above is that a ``conformant process'' has the following options: 1) Signal an error. 2) Silently discard the ill-formed sequence. 3) Replace the ill-formed sequence by a number of U+FFFD replacement characters. The example given is compatible with many different ways of determining the exact number of replacement characters, including one for the entire sequence, one per code unit (octet) and one per ``malformed sequence'' as defined in [1]. 4) Perform some other kind of error handling. [1] <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt> > The second option (silently discarding bad data) is bad, > but requiring all implementations to do any U-FFFD > substitution is too much of a burden. I take your word for it, but perhaps the Unicode standard should at least mention that option(s) 2 (and 4) are better avoided? I would also like to see the number of replacement characters generated under option 3 to be defined somehow. The current text in C10 is rather too vague if it is really meant to do that. (Further discussion on this topic should probably be moved to a Unicode-specific forum.) -- Øistein E. Andersen
Received on Saturday, 1 March 2008 01:42:14 UTC