RE: Comments on draft-yergeau-rfc2279bis-00.txt

Hi Francois,

Please look at RFC 2640 "Internationalization of FTP" (July 1999,
Proposed Std status currently), which says:


2.1 International Character Set

   The character set defined for international support of FTP SHALL be
   the Universal Character Set as defined in ISO 10646:1993 as amended.
   This standard incorporates the character sets of many existing
   international, national, and corporate standards. ISO/IEC 10646
   defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a
   four byte (31 bit) encoding containing 2**31 code positions divided
   into 128 groups of 256 planes. Each plane consists of 256 rows of 256
   cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane
   zero or the Basic Multilingual Plane (BMP).  Currently, no codesets
   have been defined outside of the 2 byte BMP.

   The Unicode standard version 2.0 [UNICODE] is consistent with the
   UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0
   includes the repertoire of IS 10646 characters, amendments 1-7 of IS
   10646, and editorial and technical corrigenda.


2.2 Transfer Encoding

   UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2
   or UTF-FSS, SHALL be used as a transfer encoding to transmit the
   international character set. UTF-8 is a file safe encoding which
   avoids the use of byte values that have special significance during
   the parsing of pathname character strings. UTF-8 is an 8 bit encoding
   of the characters in the UCS. Some of UTF-8's benefits are that it is
   compatible with 7 bit ASCII, so it doesn't affect programs that give
   special meanings to various ASCII characters; it is immune to
   synchronization errors; its encoding rules allow for easy
   identification; and it has enough space to support a large number of
   character sets.

<...snip...more description of the details and virtues of UTF-8...>


3.2 Servers compliance

   - Servers MUST support the UTF-8 feature in response to the FEAT
     command [RFC2389]. The UTF-8 feature is a line containing the exact
     string "UTF8". This string is not case sensitive, but SHOULD be
     transmitted in upper case. The response to a FEAT command SHOULD
     be:

        C> feat
        S> 211- <any descriptive text>
        S>  ...
        S>  UTF8
        S>  ...
        S> 211 end

   The ellipses indicate placeholders where other features may be
   included, but are NOT REQUIRED. The one space indentation of the
   feature lines is mandatory [RFC2389]."


Such an FTP server explicitly negotiates with the FTP client that they
BOTH support UTF-8 for the transfer encoding.  It thus becomes the
responsibility of the CLIENT to previously convert legacy encodings
to UTF-8.  The target system will receive and (hopefully) store the
transferred file in UTF-8.

Cheers,
- Ira McDonald
  High North Inc


-----Original Message-----
From: Francois Yergeau [mailto:FYergeau@alis.com]
Sent: Friday, October 04, 2002 3:53 PM
To: ietf-charsets@iana.org
Subject: RE: Comments on draft-yergeau-rfc2279bis-00.txt


Martin Duerst wrote:
> As far as I understand most contributions on the list in the past
> day or so, the standard should discourage the BOM, but it currently
> doesn't.

That much is clear.  It seems there will have to be a draft-03 with some
additional language in that direction.
 

> > > UTF-8 never needs a 'byte-order' signature.
> >
> >This is unfortunately not true, except in the limited realm 
> >of properly internationalized protocols
> 
> As for example IETF protocols.

Errr, some IETF protocols.  I have no way to tell an FTP server what is the
charset of a file I'm uploading, nor does the server have any way of telling
me the charset of a file I'm downloading.  And even if it had a way (like in
HTTP), the server most probably wouldn't know and would either not tell or
lie.

-- 
François

Received on Friday, 4 October 2002 17:30:01 UTC