W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2008

Fwd: several messages about handling encodings in HTML

From: Mark Davis <mark.davis@icu-project.org>
Date: Mon, 3 Mar 2008 13:20:38 -0800
Message-ID: <30b660a20803031320q2caef186mf6e3c68cf93469cc@mail.gmail.com>
To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Attached is a message that may not have shown up.

Mark

---------- Forwarded message ----------
From: Richard Ishida <ishida@w3.org>
Date: Mon, Mar 3, 2008 at 10:42 AM
Subject: RE: several messages about handling encodings in HTML
To: Mark Davis <mark.davis@icu-project.org>


Yes, I got that. It was addressed just to me though. Will you send it to
public-i18n-core again? I'll look out for it, and let you know if it arrives
there.

RI

============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/blog/
http://rishida.net/



> -----Original Message-----
> From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On
> Behalf Of Mark Davis
> Sent: 03 March 2008 18:18
> To: ishida@w3.org
> Subject: Fwd: several messages about handling encodings in HTML
>
> Hmmm. Not sure why this didn't get through. Let me know if you get it.
>
>
> ---------- Forwarded message ----------
> From: Mark Davis <mark.davis@google.com>
> Date: Thu, Feb 28, 2008 at 6:05 PM
> Subject: Re: several messages about handling encodings in HTML
> To: Ian Hickson <ian@hixie.ch>
> Cc: public-i18n-core@w3.org
>
>
> Sorry, hit the send button by accident earlier. My comment below.
>
>
>  > On Thu, Feb 28, 2008 at 5:21 PM, Ian Hickson <ian@hixie.ch> wrote:
>  ...
>
> >  >  On Sat, 23 Jun 2007, Øistein E. Andersen wrote:
>  >  >  > >>
>  >  >  > >>> Bytes or sequences of bytes in the original byte stream that
> could
>  >  >  > >>> not be converted to Unicode characters must be converted to
> U+FFFD
>  >  >  > >>> REPLACEMENT CHARACTER code points.
>  >  >  > >>
>  >  >  > >> [This does not specify the exact number of replacement
> chracters.]
>  >  >  > >
>  >  >  > > I don't really know how to define this.
>  >  >  >
>  >  >  > Unicode 5.0 remains vague on this point. (E.g., definition D92
> defines
>  >  >  > well-formed and ill-formed UTF-8 byte sequences, but conformance
>  >  >  > requirement C10 only requires ill-formed sequences to be treated
> as an
>  >  >  > error condition and suggests that a one-byte ill-formed sequence
> may be
>  >  >  > either filtered out or replaced by a U+FFFD replacement
> character.) More
>  >  >  > generally, character encoding specifications can hardly be
> expected to
>  >  >  > define proper error handling, since they are usually not terribly
>  >  >  > preoccupied with mislabelled data.
>  >  >
>  >  >  They should define error handling, and are defective if they don't.
>  >  >  However, I agree that many specs are defective. This is certainly
> not
>  >  >  limited to character encoding specifications.>
>
>  This was discussed in the Unicode consortium, and there is certain
>  text introduced in Unicode 5.1. In particular, the way some
>  applications implemented error handling, they might "eat into" valid
>  subsequent characters. That is now (well, will shortly be) expressly
>  forbidden. It was important to get this done, since it can be used in
>  security exploits. See http://www.unicode.org/versions/Unicode5.1.0/
>
>  However, as far as whether a sequence of erroneous bytes should be
>  considered one error or several, that is left to the implementation:
>
>  "Although a UTF-8 conversion process is required to never consume
>  well-formed subsequences as part of its error handling for ill-formed
>  subsequences, such a process is not otherwise constrained in how it
>  deals with any ill-formed subsequence itself. An ill-formed
>  subsequence consisting of more than one code unit could be treated as
>  a single error or as multiple errors. For example, in processing the
>  UTF-8 code unit sequence <F0 80 80 41>, the only requirement on a
>  converter is that the <41> be processed and correctly interpreted as
>  <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80
>  80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling
>  each byte of <F0 80 80> as a separate error, or could take other
>  approaches to signalling <F0 80 80> as an ill-formed code unit
>  subsequence."
>
>  In a perfect world, the standard would have said whether the <F0 80
>  80> (for example) was one error or many from the beginning. For
>  example, a common approach is to consume at least one byte, but then
>  stop just before the first byte that could not be added to make a
>  valid character. For example, in the sequence <C0 80 C2 E0 80 F0 80
>  80> you'd get the following separate errors (x marking the boundaries
>  between the sequences considered as errors): x C0 x 80 x C2 x E0 80 x
>  F0 80 80 x.
>
>  However, implementations are not consistent in how they currently
>  break up the erroneous sequences, and the members of the consortium
>  did not feel that it was important to enforce a single approach in
>  this area.
>
>  --
>  Mark
>
>
>
> --
> Mark




-- 
Mark
Received on Monday, 3 March 2008 21:20:54 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:53 GMT