[whatwg] 9.2.2: replacement characters. How many? from Øistein E. Andersen on 2007-06-23 (public-whatwg-archive@w3.org from June 2007)

From: Øistein E. Andersen <html5@xn--istein-9xa.com>
Date: Sat, 23 Jun 2007 02:27:30 +0200
Message-ID: <E1I1tTO-0008J1-4e@node1-6.ouvaton.local>

Ian Hickson wrote:

> On Fri, 3 Nov 2006, Elliotte Harold wrote:
>
>> Section 9.2.2 of the current Web Apps 1.0 draft states:
>> 
>>> Bytes or sequences of bytes in the original byte stream that could not 
>>> be converted to Unicode characters must be converted to U+FFFD 
>>> REPLACEMENT CHARACTER code points.
>> 
>> [This does not specify the exact number of replacement chracters.]
>
> I don't really know how to define this.
> I'd like to say that it's up to the encoding specifications
> to define it. Any suggestions?

Unicode 5.0 remains vague on this point. (E.g., definition D92
defines well-formed and ill-formed UTF-8 byte sequences, but
conformance requirement C10 only requires ill-formed sequences
to be treated as an error condition and suggests that a one-byte
ill-formed sequence may be either filtered out or replaced by
a U+FFFD replacement character.) More generally, character
encoding specifications can hardly be expected to define proper
error handling, since they are usually not terribly preoccupied
with mislabelled data.

Henri Sivonen has pointed out that a strict requirement on the
number of replacement characters generated may cause
unnecessary incompatibilities with current browsers and extant
tools.

The current text may nevertheless be two liberal. It would
notably be possible to construct an arbitrarily long Chinese
text in a legacy encoding which -- according to the spec -- could
be replaced by one single U+FFFD replacement character if
incorrectly handled as UTF-8, which might lead the user to
think that the page is completely uninteresting and therefore
move on, whereas a larger number of replacement characters
would have led him to try another encoding. (This is only a
problem, of course, if an implementor chooses to emit the
minimal number of replacement characters sanctioned by the spec.)

The current upper bound (number of bytes replaced) seems
intuitive and completely harmless.

A meaningful lower bound is less obvious, at least
if we want to give some leeway to different implementations.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
details an approach for UTF-8 that basically emits a replacement
character and removes read bytes from the buffer each time a
minimal malformed byte sequence has been detected. Safari,
Opera and Firefox all mostly follow this, whereas IE7 usually
emits one replacement character per replaced byte. (Interesting
cases include byte sequences encoding forbidden characters like
U+FFFF mod U+1,0000 or exceeding U+10,FFFF.)

It should be relatively simple to define something like this
for any multi-byte encoding, but perhaps less straightforward
for encodings using escape sequences to switch between different
alphabets or other more exotic encodings -- if we have to worry
about those.

-- 
?istein E. Andersen

Received on Friday, 22 June 2007 17:27:30 UTC