RE: Unicode error handling (was several messages about handling encodings in HTML)

Brian Smith wrote:

> Unicode does define the error handling explicitly. An
> implementation must handle an ill-formed sequence by
> "signaling an error, filtering the code unit out, or
> representing the code unit with a marker such as U+FFFD
> replacement character."

I assume you are quoting from C10 on p. 73 of Unicode
5.0.  The quote looks a bit different in context:

    For example, in UTF-8 every code unit of the form
    110xxxxx_2 /must/ be followed by a code unit of
    the form 10xxxxxx_2.  A sequence such as 110xxxxx_2
    0xxxxxxx_2 is ill-formed and must never be
    generated.  When faced with this ill-formed code
    unit sequence while transforming or interpreting
    text, a conformant process must treat the first
    code unit 110xxxxx_2 as an illegally terminated
    code unit sequence---for example, by signalling
    an error, filtering the code unit out, or
    representing the code unit with a marker such
    as U+FFFD REPLACEMENT CHARACTER.

Most notably, the term ``code unit'' does not necessarily
refer to any code unit that is part of an ill-formed
sequence, but at most to a code unit which cannot occur
in isolation (and therefore constitutes an ill-formed
sequence) and which is not immediately adjacent to
other ill-formed sequences.  Exactly how to handle
other types of ill-formed sequences cannot really
be inferred from the example.

My interpretation of the text quoted above is that a
``conformant process'' has the following options:

    1) Signal an error.
    2) Silently discard the ill-formed sequence.
    3) Replace the ill-formed sequence by a number
       of U+FFFD replacement characters.  The example
       given is compatible with many different ways
       of determining the exact number of replacement
       characters, including one for the entire sequence,
       one per code unit (octet) and one per ``malformed
       sequence'' as defined in [1].
    4) Perform some other kind of error handling.

[1] <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>


> The second option (silently discarding bad data) is bad,
> but requiring all implementations to do any U-FFFD
> substitution is too much of a burden.

I take your word for it, but perhaps the Unicode standard
should at least mention that option(s) 2 (and 4) are better
avoided?

I would also like to see the number of replacement
characters generated under option 3 to be defined somehow.
The current text in C10 is rather too vague if it is
really meant to do that.

(Further discussion on this topic should probably
be moved to a Unicode-specific forum.)

-- 
Øistein E. Andersen

Received on Saturday, 1 March 2008 01:42:14 UTC