RE: internationalization/ISO10646 question from Marcin Hanclik on 2002-11-25 (ietf-charsets@w3.org from October to December 2002)

From: Marcin Hanclik <mhanclik@poczta.onet.pl>
Date: Mon, 25 Nov 2002 21:09:21 +0100
To: ned.freed@mrochek.com
Cc: ietf-charsets@iana.org
Message-id: <OLENIGGFKBOAIMPONAAJKEEPCDAA.mhanclik@poczta.onet.pl>
Hi!

Your explanation means that you cannot send UTF-16 encoding, because it
cannot preserve CRLF.
You could not send any unicode characters (apart from UTF-8) in MIME then!!!

The media type you are writing about is to be used in the form:
Content-Type: text/utf-16...

and I mean:
Content-Type: text/plain; charset="UTF-16"

So I understand from your mail that BOM should be accepted when we have:
Content-Type: text/plain; charset="iso-10646-ucs-2"


RGDS/Marcin
> -----Original Message-----
> From: ned.freed@mrochek.com [mailto:ned.freed@mrochek.com]
> Sent: Monday, 25 November, 2002 19:10
> To: Marcin Hanclik
> Cc: ned.freed@mrochek.com
> Subject: RE: internationalization/ISO10646 question
>
>
> > Dear Ned,
>
> > thank you very much for answer.
> > However, I would like to discuss it.
>
> > > -----Original Message-----
> > > From: ned.freed@mrochek.com [mailto:ned.freed@mrochek.com]
> > > Sent: Friday, 22 November, 2002 19:52
> > > To: Marcin Hanclik
> > > Cc: ietf-charsets@iana.org
> > > Subject: Re: internationalization/ISO10646 question
> > >
> > >
> > > > Dear Sirs,
> > >
> > > > I am writing to you as to the experts in internationalization
> > > and ISO-10646
> > > > issues.
> > >
> > > > I would be very grateful if you could help me with the
> following issue
> > > > described below.
> > >
> > > > Generally the question refers to MIME encoding of text part.
> > > > Particularily to the following case:
> > > > Content-Type: text/plain; charset="iso-10646-ucs-2"
> > > > Content-Transfer-Encoding: ...
> > >
> > > This, I'm afraid, is an illegal combination of elements.
> Specifically, any
> > > material with a top level media type of "text" has to
> represent carriage
> > > return/line feed as the literal sequence 0x13 0x10.
> > > iso-10646-ucs-2 clearly
> > > does not do this, and as such is a media type that's not suited
> > > for use with
> > > MIME text.
> > >
> > > This requirement is spelled out in RFC 2046 section 4.1.1.
> > I think it is not the case. Content-Transfer-Encoding header has to take
> > care of CRLF handling.
> > It is specified in RFC2046, 4.1.2.
>
> On the contrary, section 4.1.2 in fact reiterates the CRLF
> requirement in that
> it discusses how the charsets can be used with other top level
> types "with the
> CRLF/line break restriction removed".
>
> > I left empty space for this parameter, but generally it is
> BASE64 in this
> > case.
>
> The restrictions on the text top level type are completely
> independent of what
> content-transfer-encoding is used. It is also true that the
> domain of various
> content-transfer-encodings are restricted in various ways,
> including but not
> limited to the use of CRLFs, but this has nothing to do with the
> restrictions
> on the text top level type.
>
> > >
> > > > Data
> > >
> > > > Data after decoding: 0xFF 0xFE 0x66 0x00 0x65 0x00
> > >
> > > > Outlook Express decodes it to "fe" string. But there are
> people, who say
> > > > that this is robustness of Outlook Express and that the
> string is not
> > > > properly encoded, because in the time when
> > > <charset="iso-10646-ucs-2"> was
> > > > specified/assigned with IANA the byte order mark (BOM) did
> not exist.
> > >
> > > I don't know if there are specific rules for handling revisions to
> > > iso-10646-ucs-2 or not. I suspect not. However, the general
> rule is that
> > > additions to a charset repetertoire are expected and allowed.
> See RFC 2279
> > > section 3. However, the BOM is something of a special case.
> > >
> > This is a good a argument to me.
> > > But given the far more egregious violation going on here I
> really don't
> > > think this is particular important in the overall scheme of things.
> > >
> > The above violation is not the case here, I think.
>
> I'm sorry, but it most certainly is the case here. Indeed, there
> would be no
> point in having the labelling of whether or not a given charset
> was "suitable
> for use in MIME text" if it weren't for this restriction.
>
> This is a case where the standards are clear, the standards
> clearly reflect the
> intent of the group that developed them, and the registration
> requirements now
> reflect the restrictions put in place by the standards.
>
> You can even see this in action in the registration of things
> like UTF-16LE.
> RFC 2781 section A.2 contains the registration for this charset, and among
> other things it says "Suitable for use in MIME content types
> under the "text"
> top-level type: No"
>
> Unfortunately iso-10646-ucs-2 was registered before the rules
> were place to
> call this out in registrations, but that doesn't mean it is
> suitable for use in
> MIME text.
>
> > So the question remains:
> > can I use <charset="iso-10646-ucs-2"> for the data containing BOM?
>
> And the answer remains that for material with a top level content
> type of text,
> which is what you said you were dealing with, you cannot use this
> charset at
> all. As such, any handling of it is possible, up to and including
> rejection of
> the message as invalid.
>
> For material that isn't labelled with a top level content type of
> text I don't
> think the situation is clear, but the intent has always been to
> allow additions
> to charsets subsequent to registration. So I think BOM should be
> supported in
> this context.
>
> 				Ned
Received on Tuesday, 3 December 2002 22:43:49 UTC