- From: Mark Davis <mark.davis@icu-project.org>
- Date: Mon, 3 Mar 2008 13:20:38 -0800
- To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
- Message-ID: <30b660a20803031320q2caef186mf6e3c68cf93469cc@mail.gmail.com>
Attached is a message that may not have shown up. Mark ---------- Forwarded message ---------- From: Richard Ishida <ishida@w3.org> Date: Mon, Mar 3, 2008 at 10:42 AM Subject: RE: several messages about handling encodings in HTML To: Mark Davis <mark.davis@icu-project.org> Yes, I got that. It was addressed just to me though. Will you send it to public-i18n-core again? I'll look out for it, and let you know if it arrives there. RI ============ Richard Ishida Internationalization Lead W3C (World Wide Web Consortium) http://www.w3.org/International/ http://rishida.net/blog/ http://rishida.net/ > -----Original Message----- > From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On > Behalf Of Mark Davis > Sent: 03 March 2008 18:18 > To: ishida@w3.org > Subject: Fwd: several messages about handling encodings in HTML > > Hmmm. Not sure why this didn't get through. Let me know if you get it. > > > ---------- Forwarded message ---------- > From: Mark Davis <mark.davis@google.com> > Date: Thu, Feb 28, 2008 at 6:05 PM > Subject: Re: several messages about handling encodings in HTML > To: Ian Hickson <ian@hixie.ch> > Cc: public-i18n-core@w3.org > > > Sorry, hit the send button by accident earlier. My comment below. > > > > On Thu, Feb 28, 2008 at 5:21 PM, Ian Hickson <ian@hixie.ch> wrote: > ... > > > > On Sat, 23 Jun 2007, Øistein E. Andersen wrote: > > > > >> > > > > >>> Bytes or sequences of bytes in the original byte stream that > could > > > > >>> not be converted to Unicode characters must be converted to > U+FFFD > > > > >>> REPLACEMENT CHARACTER code points. > > > > >> > > > > >> [This does not specify the exact number of replacement > chracters.] > > > > > > > > > > I don't really know how to define this. > > > > > > > > Unicode 5.0 remains vague on this point. (E.g., definition D92 > defines > > > > well-formed and ill-formed UTF-8 byte sequences, but conformance > > > > requirement C10 only requires ill-formed sequences to be treated > as an > > > > error condition and suggests that a one-byte ill-formed sequence > may be > > > > either filtered out or replaced by a U+FFFD replacement > character.) More > > > > generally, character encoding specifications can hardly be > expected to > > > > define proper error handling, since they are usually not terribly > > > > preoccupied with mislabelled data. > > > > > > They should define error handling, and are defective if they don't. > > > However, I agree that many specs are defective. This is certainly > not > > > limited to character encoding specifications.> > > This was discussed in the Unicode consortium, and there is certain > text introduced in Unicode 5.1. In particular, the way some > applications implemented error handling, they might "eat into" valid > subsequent characters. That is now (well, will shortly be) expressly > forbidden. It was important to get this done, since it can be used in > security exploits. See http://www.unicode.org/versions/Unicode5.1.0/ > > However, as far as whether a sequence of erroneous bytes should be > considered one error or several, that is left to the implementation: > > "Although a UTF-8 conversion process is required to never consume > well-formed subsequences as part of its error handling for ill-formed > subsequences, such a process is not otherwise constrained in how it > deals with any ill-formed subsequence itself. An ill-formed > subsequence consisting of more than one code unit could be treated as > a single error or as multiple errors. For example, in processing the > UTF-8 code unit sequence <F0 80 80 41>, the only requirement on a > converter is that the <41> be processed and correctly interpreted as > <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80 > 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling > each byte of <F0 80 80> as a separate error, or could take other > approaches to signalling <F0 80 80> as an ill-formed code unit > subsequence." > > In a perfect world, the standard would have said whether the <F0 80 > 80> (for example) was one error or many from the beginning. For > example, a common approach is to consume at least one byte, but then > stop just before the first byte that could not be added to make a > valid character. For example, in the sequence <C0 80 C2 E0 80 F0 80 > 80> you'd get the following separate errors (x marking the boundaries > between the sequences considered as errors): x C0 x 80 x C2 x E0 80 x > F0 80 80 x. > > However, implementations are not consistent in how they currently > break up the erroneous sequences, and the members of the consortium > did not feel that it was important to enforce a single approach in > this area. > > -- > Mark > > > > -- > Mark -- Mark
Received on Monday, 3 March 2008 21:20:54 UTC