RE: Unicode error handling (was several messages about handling encodings in HTML)

Unicode 5.1 properly defines ill-formed subsequences and makes
it clear(er) that these shall never impede correct interpretation
of adjacent, well-formed UTF-8 byte sequences.

Unfortunately, however, no guidance is given as to how many
replacement characters should be emitted for a multi-byte
ill-formed subsequence (not even that the number should not
exceed the number of bytes, but this is clearly intended).
I do realise, of course, that it may be problematic to make
this a conformance criterion, but it might be useful if a
future version of the standard could at least provide a
suggestion for new implementations.

(Relevant quote and link may be found below.)

istein E. Andersen

Although a UTF-8 conversion process is required to never
consume well-formed subsequences as part of its error handling
for ill-formed subsequences, such a process is not otherwise
constrained in how it deals with any ill-formed subsequence
itself. An ill-formed subsequence consisting of more than one
code unit could be treated as a single error or as multiple
errors. For example, in processing the UTF-8 code unit sequence
<F0 80 80 41>, the only requirement on a converter is that the
<41> be processed and correctly interpreted as <U+0041>. The
converter could return <U+FFFD, U+0041>, handling <F0 80 80> as
a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling
each byte of <F0 80 80> as a separate error, or could take other
approaches to signalling <F0 80 80> as an ill-formed code unit
                      Unicode 5.1.0

Received on Monday, 7 April 2008 19:20:13 UTC