Re: Registration of new charsets UTF-32, UTF-32BE, UTF32LE from Mark Davis on 2001-05-11 (ietf-charsets@w3.org from April to June 2001)

From: Mark Davis <markdavis34@home.com>
Date: Fri, 11 May 2001 07:31:33 -0700
To: Harald Tveit Alvestrand <harald@alvestrand.no>, Mark Davis <mark.davis@us.ibm.com>
Cc: ned.freed@mrochek.com, ietf-charsets@iana.org
Message-id: <007c01c0da27$13dd16d0$0c680b41@c1340594a>

Thanks for your feedback. I will resubmit them.

Comments:

A. If each charset needs to be in a separate message, then you really ought
to fix http://www.normos.org/ietf/bcp/bcp19.txt. It says:

"5.  Charset Registration Template

     To: ietf-charsets@iana.org
     Subject: Registration of new charset [names]"

with the word "names" in plural. This is misleading.

B. UTF-32 in Unicode, as with UTF-16, could be BOM-less, with the
orientation being determined by a higher-level protocol. The IETF
registration (with good reason!) can impose a further restriction, as it
does with UTF-16, that BOM-less UTF-16 must be BE. I will put such a clause
in the registration.

Mark



----- Original Message -----
From: "Harald Tveit Alvestrand" <harald@alvestrand.no>
To: "Mark Davis" <mark.davis@us.ibm.com>
Cc: <ned.freed@mrochek.com>; <ietf-charsets@iana.org>
Sent: Thursday, May 10, 2001 23:52
Subject: Re: Registration of new charsets UTF-32, UTF-32BE, UTF32LE


> At 14:33 09.05.2001 -0700, Mark Davis wrote:
>
> >Sorry, I missed that. Do you want me to resubmit, or could you just make
> >that change?
>
> Resubmit.
>
> note: each charset should have its own registration form.
>
> BTW, TR19 is technically broken in its definition of UTF-32: it specifies
> that an UTF-8 character stream MAY OR MAY NOT begin with a Byte Order
Mark,
> and that octets can be in any order.
>
> >D36c
> >        (a) UTF-32 is the Unicode Transformation Format that serializes a
> > Unicode code point as a sequence of four bytes, in either big-endian or
> > little-endian format. An initial sequence corresponding to U+FEFF is
> > interpreted as a byte order mark: it is used to distinguish between the
> > two byte orders. The byte order mark is not considered part of the
> > content of    the text. A serialization of Unicode code points into
> > UTF-32 may or may not begin with a byte order mark.
>
> This allows (when taking exquisite care - you only have 4.1 bits that are
> valid in both upper and lower halves of the 32-bit word) the construction
> of octet sequences that are ambiguous.
>
> If either the specification or the registration had said "A serialization
> of Unicode code points into UTF-32 that does not begin with a byte order
> mark MUST be in Big Endian", I would not have protested. But this is,
IMHO,
> just too broken to be registered as a charset.
>
> As written, I OPPOSE the registration of UTF-32.
> (Apologies for having missed it at Unicode standardization time - we saw
it
> coming, and did not catch it in time)
>
>                 Harald
>

Received on Friday, 11 May 2001 12:37:10 UTC