- From: Martin Duerst <duerst@w3.org>
- Date: Sat, 05 Oct 2002 14:48:01 +0900
- To: "McDonald, Ira" <imcdonald@sharplabs.com>, "'Francois Yergeau'" <FYergeau@alis.com>, ietf-charsets@iana.org
Hello Ira,
My understanding of rfc 2640 is that it uses utf-8 for file/path
names, and for language-negotiated messages, but not for the encoding
of the actual file contents. This is what was discussed when work
on this rfc (to which I contributed) was going on, but can also
be read from the snippets of text extracted below (you won't find
any corresponding text for file contents).
Regards, Martin.
For support of global compatibility it
is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
when exchanging pathnames. Clients and servers are, however, under
no obligation to perform any conversion on the contents of a file for
operations such as STOR or RETR.
The character set used to store files SHALL remain a local decision
and MAY depend on the capability of local operating systems. Prior to
the exchange of pathnames they SHOULD be converted into a ISO/IEC
10646 format and UTF-8 encoded. This approach, while allowing
international exchange of pathnames, will still allow backward
compatibility with older systems because the code set positions for
ASCII characters are identical to the one byte sequence in UTF-8.
- The 7-bit restriction for pathnames exchanged is dropped.
- Conforming clients and servers MUST support UTF-8 for the transfer
and receipt of pathnames. Clients and servers MAY in addition give
users a choice of specifying interpretation of pathnames in another
encoding. Note that configuring clients and servers to use
character sets / encoding other than UTF-8 is outside of the scope
of this document. While it is recognized that in certain
operational scenarios this may be desirable, this is left as a
quality of implementation and operational issue.
- Pathnames are sequences of bytes. The encoding of names that are
valid UTF-8 sequences is assumed to be UTF-8. The character set of
other names is undefined. Clients and servers, unless otherwise
configured to support a specific native character set, MUST check
for a valid UTF-8 byte sequence to determine if the pathname being
presented is UTF-8.
- To avoid data loss, clients and servers SHOULD use the UTF-8
encoded pathnames when unable to convert them to a usable code set.
- Clients which do not require display of pathnames are under no
obligation to do so. Non-display clients do not need to conform to
requirements associated with display.
- Clients, which are presented UTF-8 pathnames by the server, SHOULD
parse UTF-8 correctly and attempt to display the pathname within
the limitation of the resources available.
- Clients MUST support the FEAT command and recognize the "UTF8"
feature (defined in 3.2 above) to determine if a server supports
UTF-8 encoding.
- Character semantics of other names shall remain undefined. If a
client detects that a server is non UTF-8, it SHOULD change its
display appropriately. How a client implementation handles non
UTF-8 is a quality of implementation issue. It MAY try to assume
some other encoding, give the user a chance to try to assume
something, or save encoding assumptions for a server from one FTP
session to another.
- Many existing clients interpret 8-bit pathnames as being in the
local character set. They MAY continue to do so for pathnames that
are not valid UTF-8.
Greetings and responses issued prior to language negotiation SHALL be
in the server's default language. Paragraph 4.5 of [RFC2277] state
that this "default language MUST be understandable by an English-
speaking person". This specification RECOMMENDS that the server
default language be English encoded using ASCII. This text may be
augmented by text from other languages. Once negotiated, server-PI
MUST return server messages and textual part of command responses in
the negotiated language and encoded in UTF-8. Server-PI MAY wish to
re-send previously issued server messages in the newly negotiated
language.
At 14:28 02/10/04 -0700, McDonald, Ira wrote:
>Hi Francois,
>
>Please look at RFC 2640 "Internationalization of FTP" (July 1999,
>Proposed Std status currently), which says:
>
>
>2.1 International Character Set
>
> The character set defined for international support of FTP SHALL be
> the Universal Character Set as defined in ISO 10646:1993 as amended.
> This standard incorporates the character sets of many existing
> international, national, and corporate standards. ISO/IEC 10646
> defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a
> four byte (31 bit) encoding containing 2**31 code positions divided
> into 128 groups of 256 planes. Each plane consists of 256 rows of 256
> cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane
> zero or the Basic Multilingual Plane (BMP). Currently, no codesets
> have been defined outside of the 2 byte BMP.
>
> The Unicode standard version 2.0 [UNICODE] is consistent with the
> UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0
> includes the repertoire of IS 10646 characters, amendments 1-7 of IS
> 10646, and editorial and technical corrigenda.
>
>
>2.2 Transfer Encoding
>
> UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2
> or UTF-FSS, SHALL be used as a transfer encoding to transmit the
> international character set. UTF-8 is a file safe encoding which
> avoids the use of byte values that have special significance during
> the parsing of pathname character strings. UTF-8 is an 8 bit encoding
> of the characters in the UCS. Some of UTF-8's benefits are that it is
> compatible with 7 bit ASCII, so it doesn't affect programs that give
> special meanings to various ASCII characters; it is immune to
> synchronization errors; its encoding rules allow for easy
> identification; and it has enough space to support a large number of
> character sets.
>
><...snip...more description of the details and virtues of UTF-8...>
>
>
>3.2 Servers compliance
>
> - Servers MUST support the UTF-8 feature in response to the FEAT
> command [RFC2389]. The UTF-8 feature is a line containing the exact
> string "UTF8". This string is not case sensitive, but SHOULD be
> transmitted in upper case. The response to a FEAT command SHOULD
> be:
>
> C> feat
> S> 211- <any descriptive text>
> S> ...
> S> UTF8
> S> ...
> S> 211 end
>
> The ellipses indicate placeholders where other features may be
> included, but are NOT REQUIRED. The one space indentation of the
> feature lines is mandatory [RFC2389]."
>
>
>Such an FTP server explicitly negotiates with the FTP client that they
>BOTH support UTF-8 for the transfer encoding. It thus becomes the
>responsibility of the CLIENT to previously convert legacy encodings
>to UTF-8. The target system will receive and (hopefully) store the
>transferred file in UTF-8.
>
>Cheers,
>- Ira McDonald
> High North Inc
>
>
>-----Original Message-----
>From: Francois Yergeau [mailto:FYergeau@alis.com]
>Sent: Friday, October 04, 2002 3:53 PM
>To: ietf-charsets@iana.org
>Subject: RE: Comments on draft-yergeau-rfc2279bis-00.txt
>
>
>Martin Duerst wrote:
> > As far as I understand most contributions on the list in the past
> > day or so, the standard should discourage the BOM, but it currently
> > doesn't.
>
>That much is clear. It seems there will have to be a draft-03 with some
>additional language in that direction.
>
>
> > > > UTF-8 never needs a 'byte-order' signature.
> > >
> > >This is unfortunately not true, except in the limited realm
> > >of properly internationalized protocols
> >
> > As for example IETF protocols.
>
>Errr, some IETF protocols. I have no way to tell an FTP server what is the
>charset of a file I'm uploading, nor does the server have any way of telling
>me the charset of a file I'm downloading. And even if it had a way (like in
>HTTP), the server most probably wouldn't know and would either not tell or
>lie.
>
>--
>Fran輟is
Received on Saturday, 5 October 2002 02:15:47 UTC