RE: internationalization/ISO10646 question from Martin Duerst on 2002-12-04 (ietf-charsets@w3.org from October to December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 05 Dec 2002 02:00:09 +0900
To: Marcin Hanclik <mhanclik@poczta.onet.pl>, ned.freed@mrochek.com
Cc: ietf-charsets@iana.org
Message-id: <4.2.0.58.J.20021205015443.048c8c58@localhost>
Hello Marcin,

I think Ned has said similar things before, but:

For email (SMTP,...),
      Content-Type: text/plain; charset="UTF-16"
just is illegal because of the CR/LF restrictions.
The same applies for
      Content-Type: text/plain; charset="iso-10646-ucs-2"

For HTTP, the situation is a bit different. HTTP uses some
'variant' of MIME, and does not inforce the CR/LF restrictions. So
      Content-Type: text/plain; charset="UTF-16"
or more typically
      Content-Type: text/html; charset="UTF-16"
is legal in HTTP. The same would apply for
      Content-Type: text/plain; charset="iso-10646-ucs-2"
or
      Content-Type: text/html; charset="iso-10646-ucs-2"

Whether a BOM should be accepted or not in that case
depends on the registration of iso-10646-ucs-2.

Regards,    Martin.

At 21:09 02/11/25 +0100, Marcin Hanclik wrote:
>Hi!
>
>Your explanation means that you cannot send UTF-16 encoding, because it
>cannot preserve CRLF.
>You could not send any unicode characters (apart from UTF-8) in MIME then!!!
>
>The media type you are writing about is to be used in the form:
>Content-Type: text/utf-16...
>
>and I mean:
>Content-Type: text/plain; charset="UTF-16"
>
>So I understand from your mail that BOM should be accepted when we have:
>Content-Type: text/plain; charset="iso-10646-ucs-2"
>
>
>RGDS/Marcin
> > -----Original Message-----
> > From: ned.freed@mrochek.com [mailto:ned.freed@mrochek.com]
> > Sent: Monday, 25 November, 2002 19:10
> > To: Marcin Hanclik
> > Cc: ned.freed@mrochek.com
> > Subject: RE: internationalization/ISO10646 question
> >
> >
> > > Dear Ned,
> >
> > > thank you very much for answer.
> > > However, I would like to discuss it.
> >
> > > > -----Original Message-----
> > > > From: ned.freed@mrochek.com [mailto:ned.freed@mrochek.com]
> > > > Sent: Friday, 22 November, 2002 19:52
> > > > To: Marcin Hanclik
> > > > Cc: ietf-charsets@iana.org
> > > > Subject: Re: internationalization/ISO10646 question
> > > >
> > > >
> > > > > Dear Sirs,
> > > >
> > > > > I am writing to you as to the experts in internationalization
> > > > and ISO-10646
> > > > > issues.
> > > >
> > > > > I would be very grateful if you could help me with the
> > following issue
> > > > > described below.
> > > >
> > > > > Generally the question refers to MIME encoding of text part.
> > > > > Particularily to the following case:
> > > > > Content-Type: text/plain; charset="iso-10646-ucs-2"
> > > > > Content-Transfer-Encoding: ...
> > > >
> > > > This, I'm afraid, is an illegal combination of elements.
> > Specifically, any
> > > > material with a top level media type of "text" has to
> > represent carriage
> > > > return/line feed as the literal sequence 0x13 0x10.
> > > > iso-10646-ucs-2 clearly
> > > > does not do this, and as such is a media type that's not suited
> > > > for use with
> > > > MIME text.
> > > >
> > > > This requirement is spelled out in RFC 2046 section 4.1.1.
> > > I think it is not the case. Content-Transfer-Encoding header has to take
> > > care of CRLF handling.
> > > It is specified in RFC2046, 4.1.2.
> >
> > On the contrary, section 4.1.2 in fact reiterates the CRLF
> > requirement in that
> > it discusses how the charsets can be used with other top level
> > types "with the
> > CRLF/line break restriction removed".
> >
> > > I left empty space for this parameter, but generally it is
> > BASE64 in this
> > > case.
> >
> > The restrictions on the text top level type are completely
> > independent of what
> > content-transfer-encoding is used. It is also true that the
> > domain of various
> > content-transfer-encodings are restricted in various ways,
> > including but not
> > limited to the use of CRLFs, but this has nothing to do with the
> > restrictions
> > on the text top level type.
> >
> > > >
> > > > > Data
> > > >
> > > > > Data after decoding: 0xFF 0xFE 0x66 0x00 0x65 0x00
> > > >
> > > > > Outlook Express decodes it to "fe" string. But there are
> > people, who say
> > > > > that this is robustness of Outlook Express and that the
> > string is not
> > > > > properly encoded, because in the time when
> > > > <charset="iso-10646-ucs-2"> was
> > > > > specified/assigned with IANA the byte order mark (BOM) did
> > not exist.
> > > >
> > > > I don't know if there are specific rules for handling revisions to
> > > > iso-10646-ucs-2 or not. I suspect not. However, the general
> > rule is that
> > > > additions to a charset repetertoire are expected and allowed.
> > See RFC 2279
> > > > section 3. However, the BOM is something of a special case.
> > > >
> > > This is a good a argument to me.
> > > > But given the far more egregious violation going on here I
> > really don't
> > > > think this is particular important in the overall scheme of things.
> > > >
> > > The above violation is not the case here, I think.
> >
> > I'm sorry, but it most certainly is the case here. Indeed, there
> > would be no
> > point in having the labelling of whether or not a given charset
> > was "suitable
> > for use in MIME text" if it weren't for this restriction.
> >
> > This is a case where the standards are clear, the standards
> > clearly reflect the
> > intent of the group that developed them, and the registration
> > requirements now
> > reflect the restrictions put in place by the standards.
> >
> > You can even see this in action in the registration of things
> > like UTF-16LE.
> > RFC 2781 section A.2 contains the registration for this charset, and among
> > other things it says "Suitable for use in MIME content types
> > under the "text"
> > top-level type: No"
> >
> > Unfortunately iso-10646-ucs-2 was registered before the rules
> > were place to
> > call this out in registrations, but that doesn't mean it is
> > suitable for use in
> > MIME text.
> >
> > > So the question remains:
> > > can I use <charset="iso-10646-ucs-2"> for the data containing BOM?
> >
> > And the answer remains that for material with a top level content
> > type of text,
> > which is what you said you were dealing with, you cannot use this
> > charset at
> > all. As such, any handling of it is possible, up to and including
> > rejection of
> > the message as invalid.
> >
> > For material that isn't labelled with a top level content type of
> > text I don't
> > think the situation is clear, but the intent has always been to
> > allow additions
> > to charsets subsequent to registration. So I think BOM should be
> > supported in
> > this context.
> >
> >                               Ned
Received on Wednesday, 4 December 2002 12:11:27 UTC