Re: UTC Agenda Item: Recommendations for handling ill-formed sequences

On Friday, 11th April 2008, Mark Davis wrote:
> I don't think it is worth any extra code or processing in UTF-8, for
> example, to determine that you have a pair of valid surrogates so that you
> can emit a single U+FFFD instead of two.

My comments on this should not be taken as arguments either way, but
I think it is worth pointing out that Markus Kuhn's approach gives one
U+FFFD character per surrogate character, i.e., two for a valid surrogate
pair, and I assume this is what John Cowan was referring to as well.
Differentiating between valid and invalid surrogate pairs does indeed
seem unnecessary.

As illustrated by the code below (assuming that err(*) outputs one U+FFFD
character), first decoding anything that looks vaguely reminiscent of a valid
UTF-8 sequence and then checking for overlong representations, surrogates and
values >U+10FFFF will give a lower number of replacement characters without
additional code.

  int c, N, missing = 0;
  long long u;
  while ((c = getchar()) != EOF) {
    int n = 0; while (c << n & 128) n++;
    c = (unsigned char) (c << n) >> n;

    switch (n) {
    case 1:
      if (missing > 0) { u = (u << 6) + c; missing--; }
      else { err(cont_byte); missing = -1; } break;
    default:
      if (missing > 0) err(incomplete);
      u = c; N = n ? n : 1; missing = N-1;
    }
    
    if (!missing) {
      if (N > 1 && (u < 128 || !(u >> 5*N - 4))) err(overlong);
      else if (0xD800 <= u && u <= 0xDFFF) err(surrogate);
      else if (u > 0x10FFFF) err(transastral);
      else out(u);
    }
  }
  if (missing > 0) err(eof);

-- 
Øistein E. Andersen

Received on Monday, 14 April 2008 09:29:50 UTC