- From: McDonald, Ira <imcdonald@sharplabs.com>
- Date: Fri, 04 Oct 2002 14:28:35 -0700
- To: "'Francois Yergeau'" <FYergeau@alis.com>, ietf-charsets@iana.org
Hi Francois, Please look at RFC 2640 "Internationalization of FTP" (July 1999, Proposed Std status currently), which says: 2.1 International Character Set The character set defined for international support of FTP SHALL be the Universal Character Set as defined in ISO 10646:1993 as amended. This standard incorporates the character sets of many existing international, national, and corporate standards. ISO/IEC 10646 defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a four byte (31 bit) encoding containing 2**31 code positions divided into 128 groups of 256 planes. Each plane consists of 256 rows of 256 cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane zero or the Basic Multilingual Plane (BMP). Currently, no codesets have been defined outside of the 2 byte BMP. The Unicode standard version 2.0 [UNICODE] is consistent with the UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0 includes the repertoire of IS 10646 characters, amendments 1-7 of IS 10646, and editorial and technical corrigenda. 2.2 Transfer Encoding UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2 or UTF-FSS, SHALL be used as a transfer encoding to transmit the international character set. UTF-8 is a file safe encoding which avoids the use of byte values that have special significance during the parsing of pathname character strings. UTF-8 is an 8 bit encoding of the characters in the UCS. Some of UTF-8's benefits are that it is compatible with 7 bit ASCII, so it doesn't affect programs that give special meanings to various ASCII characters; it is immune to synchronization errors; its encoding rules allow for easy identification; and it has enough space to support a large number of character sets. <...snip...more description of the details and virtues of UTF-8...> 3.2 Servers compliance - Servers MUST support the UTF-8 feature in response to the FEAT command [RFC2389]. The UTF-8 feature is a line containing the exact string "UTF8". This string is not case sensitive, but SHOULD be transmitted in upper case. The response to a FEAT command SHOULD be: C> feat S> 211- <any descriptive text> S> ... S> UTF8 S> ... S> 211 end The ellipses indicate placeholders where other features may be included, but are NOT REQUIRED. The one space indentation of the feature lines is mandatory [RFC2389]." Such an FTP server explicitly negotiates with the FTP client that they BOTH support UTF-8 for the transfer encoding. It thus becomes the responsibility of the CLIENT to previously convert legacy encodings to UTF-8. The target system will receive and (hopefully) store the transferred file in UTF-8. Cheers, - Ira McDonald High North Inc -----Original Message----- From: Francois Yergeau [mailto:FYergeau@alis.com] Sent: Friday, October 04, 2002 3:53 PM To: ietf-charsets@iana.org Subject: RE: Comments on draft-yergeau-rfc2279bis-00.txt Martin Duerst wrote: > As far as I understand most contributions on the list in the past > day or so, the standard should discourage the BOM, but it currently > doesn't. That much is clear. It seems there will have to be a draft-03 with some additional language in that direction. > > > UTF-8 never needs a 'byte-order' signature. > > > >This is unfortunately not true, except in the limited realm > >of properly internationalized protocols > > As for example IETF protocols. Errr, some IETF protocols. I have no way to tell an FTP server what is the charset of a file I'm uploading, nor does the server have any way of telling me the charset of a file I'm downloading. And even if it had a way (like in HTTP), the server most probably wouldn't know and would either not tell or lie. -- François
Received on Friday, 4 October 2002 17:30:01 UTC