Re: Error message for invalid UTF-8 overlong forms should be improved from Frank Ellermann on 2008-05-28 (www-validator@w3.org from May 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Wed, 28 May 2008 14:55:15 +0200
To: www-validator@w3.org
Message-ID: <g1jkkn$afb$1@ger.gmane.org>

Jukka K. Korpela wrote:

> "UTF-8 overlong form" is a misnomer.

STD 63 uses "overlong UTF-8 sequence" for C0 80, of course
stating that this is an error.  The older RFC 2279 said
"invalid" for the same example.

The "overlong" business can be interesting for smart error
handling, C0 80 should be one error, not two, while C1 3A
can be reported as error followed by a valid UTF-8 u+003A.

A smart error handling could also minimize the reported 
errors for surrogates and code points above plane 16, it
can silently skip all plausible trail bytes in an invalid
sequence starting with C0..FD.

>>> The error was: utf8 Illegal overlong form "\xC1\x3A"

> No, "overlong form" is not a commonly understood concept

While I disagree, 3A is no plausible trail byte, we don't
know what went wrong, and silently ignoring the 3A could
cause spurious error reports.

 Frank

Received on Wednesday, 28 May 2008 12:54:33 UTC