- From: Mark Davis <mark.davis@icu-project.org>
- Date: Fri, 11 Apr 2008 15:38:56 -0700
- To: UTC <unicore@Unicode.ORG>, "utc-chair@unicode.org" <utc-chair@unicode.org>, public-html@w3.org
- Message-ID: <30b660a20804111538t49de1d09le905f7e775f0f47b@mail.gmail.com>
Please add the following to the doc registry and agenda for the next UTC meeting. Am also cc'ing public-html@w3.org for comments. =========== Re:Recommendations for handling ill-formed sequences To: UTCDate: April 11 From:Mark Davis In converting or validating Unicode, there is no requirement that an ill-formed sequence be replaced by U+FFFD characters; an application can, for example, throw an exception. However, when replacement is done, we should at least indicate what the recommended practice is, so that people can require conformance to that practice for interoperability. (Following the proposal is an email trail that sparked this proposal.) Here is a proposal for adding such a recommendation to a future version of the standard, and to an FAQ in the meantime. (The wording is draft, and would be refined by the editorial committee.) When replacing an ill-formed sequence by one or more U+FFFD characters, the recommended practice is to progress through the sequence as follows, where at each byte: - If the byte cannot start a minimal well-formed code unit subsequence (D85a), skip that byte and emit one U+FFFD character. - Otherwise, find the longest sequence of bytes that are at the start of some minimal well-formed code unit subsequence (D85a), then skip them and emit one U+FFFD character. For example, in UTF-8 each the following ill-formed subsequences would be replaced by a single U+FFFD, given a following byte. The ! means that a byte is missing (end of the byte sequence) or not within the given range. Typically this is !80..BF; exceptions are underlined below. *Sequences to be replaced by U+FFFD* *If followed by* 80..C1 *!00..FF * C2..DF !80..BF E0 !A0..BF E1..EC !80..BF ED !80..9F EE..EF !80..BF F0 !90..BF F1..F3 !80..BF F4 !80..8F F5..FF *!00..FF* E0 A0..BF !80..BF E1..EC 80..BF !80..BF ED 80..9F !80..BF EE..EF 80..BF !80..BF F0 90..BF !80..BF F1..F3 80..BF !80..BF F4 80..8F !80..BF F0 90..BF 80..BF !80..BF F1..F3 80..BF 80..BF !80..BF F4 80..8F 80..BF !80..BF Comment on the above from Øistein E. Andersen: ... your proposal appears to be similar to what browsers have already implemented as well as to Markus Kuhn's notion of `malformed sequences' described in <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt<http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples/UTF-8-test.txt>>. One notable difference is that overlong sequences as well as UTF-8 sequences representing surrogates and characters outside Unicode (>10FFFF) will typically map to several replacement characters according to your proposal, but to only one in Markus Kuhn's system. This difference may not be a problem in practice and your proposal may well be superior, but it might nevertheless be worthwhile to consider what current implementations do (Safari is quite close to what Markus Kuhn suggests, and I believe I have seen browsers do what your proposal suggests for the range >10FFFF) as well as what seems reasonable and not too cumbersome to specify. The comments in this paragraph may also be forwarded as you find appropriate. =================== > > -----Original Message----- > Date/Time: Fri Apr 11 12:29:13 CDT 2008 > Contact: < <oa223@cam.ac.uk>html5@xn--istein-9xa.com> > Name: Andersen > Report Type: Other Question, Problem, or Feedback Opt Subject: Error > handling for UTF-8 > > Dear Sir or Madam, > > The editor of HTML5, Ian Hickson, has expressed that he would like Unicode > to define error handling for UTF-8 in more detail, more specifically that > any byte stream labelled as UTF-8 unambiguously map to a sequence of Unicode > characters (assuming that erroneous byte sequences are handled by insertion > of U+FFFD characters). > This is not currently (per Unicode 5.1) the case since the number of > U+FFFD characters is left undefined. > > The following quotes are from some of the e-mails sent to > public-html@w3.org concerning this issue. > > Ian Hickson: > [The Unicode standards] should define error handling, and are > defective if they don't. > ---< > http://lists.w3.org/Archives/Public/public-html/2008Feb/0408.html> > > Ian Hickson: > The point is that Unicode _doesn't_ define exactly how many bytes > form one > ill-formed sequence. Unicode doesn't define the error handling in > enough > detail to get interoperable handling of arbitrary non-conforming > byte > streams. > ---< > http://lists.w3.org/Archives/Public/public-html/2008Feb/0437.html> > > [My comment: Unicode 5.1 _does_ define the concept of an ill-formed > sequence, but this does not completely solve the issue given that the number > of replacement characters to emit remains undefined.] > > Anne van Kesteren: > I agree that it would be ideal if for input 'charset' and 'byte > stream', > output 'character stream' is always identical regardless of what > implementation you pick, but the [Unicode] specification does not > seem > to be developed with that in mind. > ---< > http://lists.w3.org/Archives/Public/public-html/2008Apr/0191.html> > > Thanks in advance for considering this. Retroactively modifying > conformance criteria may not be an attractive option, but a clear suggestion > for new implementations to follow would also be useful. > > Yours faithfully, > Øistein E. Andersen >
Received on Friday, 11 April 2008 22:49:41 UTC