Re: Error message for invalid UTF-8 overlong forms should be improved

Jukka K. Korpela wrote:
 
> "UTF-8 overlong form" is a misnomer.

STD 63 uses "overlong UTF-8 sequence" for C0 80, of course
stating that this is an error.  The older RFC 2279 said
"invalid" for the same example.

The "overlong" business can be interesting for smart error
handling, C0 80 should be one error, not two, while C1 3A
can be reported as error followed by a valid UTF-8 u+003A.

A smart error handling could also minimize the reported 
errors for surrogates and code points above plane 16, it
can silently skip all plausible trail bytes in an invalid
sequence starting with C0..FD.

>>> The error was: utf8 Illegal overlong form "\xC1\x3A"
 
> No, "overlong form" is not a commonly understood concept

While I disagree, 3A is no plausible trail byte, we don't
know what went wrong, and silently ignoring the 3A could
cause spurious error reports.

 Frank

Received on Wednesday, 28 May 2008 12:54:33 UTC