UTC Agenda Item: Recommendations for handling ill-formed sequences from Mark Davis on 2008-04-11 (public-html@w3.org from April 2008)

From: Mark Davis <mark.davis@icu-project.org>
Date: Fri, 11 Apr 2008 15:38:56 -0700
To: UTC <unicore@Unicode.ORG>, "utc-chair@unicode.org" <utc-chair@unicode.org>, public-html@w3.org
Message-ID: <30b660a20804111538t49de1d09le905f7e775f0f47b@mail.gmail.com>
Please add the following to the doc registry and agenda for the next UTC
meeting. Am also cc'ing public-html@w3.org for comments.

===========

Re:Recommendations for handling ill-formed sequences
 To: UTCDate: April 11
From:Mark Davis

In converting or validating Unicode, there is no requirement that an
ill-formed sequence be replaced by U+FFFD characters; an application can,
for example, throw an exception. However, when replacement is done, we
should at least indicate what the recommended practice is, so that people
can require conformance to that practice for interoperability. (Following
the proposal is an email trail that sparked this proposal.)

Here is a proposal for adding such a recommendation to a future version of
the standard, and to an FAQ in the meantime. (The wording is draft, and
would be refined by the editorial committee.)

When replacing an ill-formed sequence by one or more U+FFFD characters, the
recommended practice is to progress through the sequence as follows, where
at each byte:

   - If the byte cannot start a minimal well-formed code unit subsequence
   (D85a), skip that byte and emit one U+FFFD character.
   - Otherwise, find the longest sequence of bytes that are at the start
   of some minimal well-formed code unit subsequence (D85a), then skip
   them and emit one U+FFFD character.

For example, in UTF-8 each the following ill-formed subsequences would be
replaced by a single U+FFFD, given a following byte. The ! means that a byte
is missing (end of the byte sequence) or not within the given range.
Typically this is !80..BF; exceptions are underlined below.

 *Sequences to be replaced by U+FFFD*

*If followed by*

80..C1





*!00..FF
*

C2..DF





!80..BF

E0





!A0..BF

E1..EC





!80..BF

ED





!80..9F

EE..EF





!80..BF

F0





!90..BF

F1..F3





!80..BF

F4





!80..8F

F5..FF





*!00..FF*

E0

A0..BF



!80..BF

E1..EC

80..BF



!80..BF

ED

80..9F



!80..BF

EE..EF

80..BF



!80..BF

F0

90..BF



!80..BF

F1..F3

80..BF



!80..BF

F4

80..8F



!80..BF

F0

90..BF

80..BF

!80..BF

F1..F3

80..BF

80..BF

!80..BF

F4

80..8F

80..BF

!80..BF

Comment on the above from  Øistein E. Andersen:

... your proposal appears to be similar to what browsers have already
implemented as well as to Markus Kuhn's notion of `malformed sequences'
described in <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt<http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples/UTF-8-test.txt>>.
One notable difference is that overlong sequences as well as UTF-8 sequences
representing surrogates and characters outside Unicode (>10FFFF) will
typically map to several replacement characters according to your proposal,
but to only one in Markus Kuhn's system.  This difference may not be a
problem in practice and your proposal may well be superior, but it might
nevertheless be worthwhile to consider what current implementations do
(Safari is quite close to what Markus Kuhn suggests, and I believe I have
seen browsers do what your proposal suggests for the range >10FFFF) as well
as what seems reasonable and not too cumbersome to specify.  The comments in
this paragraph may also be forwarded as you find appropriate.

===================

>
> -----Original Message-----
> Date/Time:    Fri Apr 11 12:29:13 CDT 2008
> Contact:      < <oa223@cam.ac.uk>html5@xn--istein-9xa.com>
> Name:         Andersen
> Report Type:  Other Question, Problem, or Feedback Opt Subject:  Error
> handling for UTF-8
>
> Dear Sir or Madam,
>
> The editor of HTML5, Ian Hickson, has expressed that he would like Unicode
> to define error handling for UTF-8 in more detail, more specifically that
> any byte stream labelled as UTF-8 unambiguously map to a sequence of Unicode
> characters (assuming that erroneous byte sequences are handled by insertion
> of U+FFFD characters).
> This is not currently (per Unicode 5.1) the case since the number of
> U+FFFD characters is left undefined.
>
> The following quotes are from some of the e-mails sent to
> public-html@w3.org concerning this issue.
>
> Ian Hickson:
>        [The Unicode standards] should define error handling, and are
> defective if they don't.
>        ---<
> http://lists.w3.org/Archives/Public/public-html/2008Feb/0408.html>
>
> Ian Hickson:
>        The point is that Unicode _doesn't_ define exactly how many bytes
> form one
>        ill-formed sequence. Unicode doesn't define the error handling in
> enough
>        detail to get interoperable handling of arbitrary non-conforming
> byte
>        streams.
>        ---<
> http://lists.w3.org/Archives/Public/public-html/2008Feb/0437.html>
>
> [My comment: Unicode 5.1 _does_ define the concept of an ill-formed
> sequence, but this does not completely solve the issue given that the number
> of replacement characters to emit remains undefined.]
>
> Anne van Kesteren:
>        I agree that it would  be ideal if for input 'charset' and 'byte
> stream',
>        output 'character stream' is always identical regardless of what
>        implementation you pick, but the [Unicode] specification does not
> seem
>        to be developed with that in mind.
>        ---<
> http://lists.w3.org/Archives/Public/public-html/2008Apr/0191.html>
>
> Thanks in advance for considering this.  Retroactively modifying
> conformance criteria may not be an attractive option, but a clear suggestion
> for new implementations to follow would also be useful.
>
> Yours faithfully,
> Øistein E. Andersen
>
Received on Friday, 11 April 2008 22:49:41 UTC