- From: Øistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Mon, 07 Apr 2008 19:49:50 +0200
- To: public-html@w3.org
Unicode 5.1 properly defines ill-formed subsequences and makes it clear(er) that these shall never impede correct interpretation of adjacent, well-formed UTF-8 byte sequences. Unfortunately, however, no guidance is given as to how many replacement characters should be emitted for a multi-byte ill-formed subsequence (not even that the number should not exceed the number of bytes, but this is clearly intended). I do realise, of course, that it may be problematic to make this a conformance criterion, but it might be useful if a future version of the standard could at least provide a suggestion for new implementations. (Relevant quote and link may be found below.) -- Øistein E. Andersen Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. For example, in processing the UTF-8 code unit sequence <F0 80 80 41>, the only requirement on a converter is that the <41> be processed and correctly interpreted as <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error, or could take other approaches to signalling <F0 80 80> as an ill-formed code unit subsequence. Unicode 5.1.0 <http://unicode.org/versions/Unicode5.1.0/>
Received on Monday, 7 April 2008 19:20:13 UTC