ISO-2022 (was: UTF-7 and java from Carl W. Brown on 2001-08-28 (www-international@w3.org from July to September 2001)

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Tue, 28 Aug 2001 12:51:01 -0700
To: <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGGEKJCIAA.cbrown@xnetinc.com>
Andrea,

I guess my frustration is that I was coding string handling routines for
different forms of Unicode and code page data.  The problem with stateful
encoding like iso-2022 is that it is not suited for string manipulation.  If
I have a pointer to anything other than the beginning of the buffer the data
is useless.  This is because the pointer can skip over escape sequences and
SO/SI characters to make the text at the pointer meaningless.

So I figure that all I can do is return an error in string handling routines
if the data is in one of these code pages.  It would however be nice to at
least be able to validate the data in the buffer to see if it is conformant.
Sure I can piece through the ISO-2022-JP, ISO-2022-KR and ISO-2022-CN
standards.  However, I see implementations using escape sequences for French
and German etc. also.

I decided that since the data is not very useful to an application until it
is converted. You can  supply language and version information to the
ISO-2022 converter and let it decide which escape sequences to honor.

Carl



> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of A. Vine
> Sent: Tuesday, August 28, 2001 12:10 PM
> To: Barry Caplan; www-international@w3.org
> Subject: Re: UTF-7 and java
>
>
> I just want to clear up a few things about mail:
>
> Headers are purportedly restricted to 7-bit (RFC 822, Section
> 3.1.2).  This is
> not always adhered to, especially in the Subject header and
> comments in the To,
> From, Cc headers.  Many Japanese use mail clients which do not follow MIME
> standards, as I have discovered.  I believe this is not uncommon in Asia.
>
> RFC 1468 - Japanese Character Encoding for Internet messages,
> ISO-2022-JP for
> Japanese emails, covers JIS X 0201 (except no half-width katakana) and JIS
> X0208.
>
> RFC 1557 - Korean Character Encoding for Internet messages,
> ISO-2022-KR plus
> EUC-KR, that is, ISO-2022-KR for the body, EUC-KR for the
> headers, this is an
> informational RFC.  Of course, in MIME, the headers would be
> formatted using RFC
> 2047 encoded-words.
>
> RFC 1922 - Chinese Character Encoding for Internet messages,
> ISO-2022-CN, this
> is meant to include both the 1st 2 planes of CNS11643 (roughly,
> Traditional
> Chinese Characters) and GB2312 (roughly, SImplified Chinese
> Characters) using
> not just escape sequences but also shift states.  It is complex
> to use, and
> therefore is not often seen.  It, too, is an informational RFC.
>
> RFC 2237 - Japanese Character Encoding for Internet messages,
> ISO-2022-JP-1,
> similar to ISO-2022-JP but adds a new escape sequence which includes JIS X
> 0212.  It is an informational RFC.  I have never encountered this charset.
>
> IMAP folder names are in Modified UTF-7, which is not the same as
> UTF-7.  Yes,
> it is similar, but in programming, similar doesn't work.
>
> If you want to read an RFC, they are always available at:
>
> http://www.ietf.org/rfc/rfcNNNN.txt
>
> where NNNN is the number of the RFC, and is variable length.  So,
> for example,
> you can read RFC 822 at http://www.ietf.org/rfc/rfc822.txt and RFC 1468 at
> http://www.ietf.org/rfc/rfc1468.txt .
>
> Andrea
> iPlanet i18n architect
> "The devil is in the details, folks."
>
Received on Tuesday, 28 August 2001 15:50:57 UTC