RE: Comments on draft-yergeau-rfc2279bis-00.txt from Martin Duerst on 2002-10-05 (ietf-charsets@w3.org from October to December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Sat, 05 Oct 2002 14:48:01 +0900
To: "McDonald, Ira" <imcdonald@sharplabs.com>, "'Francois Yergeau'" <FYergeau@alis.com>, ietf-charsets@iana.org
Message-id: <4.2.0.58.J.20021005105052.04216378@localhost>
Hello Ira,

My understanding of rfc 2640 is that it uses utf-8 for file/path
names, and for language-negotiated messages, but not for the encoding
of the actual file contents. This is what was discussed when work
on this rfc (to which I contributed) was going on, but can also
be read from the snippets of text extracted below (you won't find
any corresponding text for file contents).

Regards,    Martin.


    For support of global compatibility it
    is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
    when exchanging pathnames.  Clients and servers are, however, under
    no obligation to perform any conversion on the contents of a file for
    operations such as STOR or RETR.

    The character set used to store files SHALL remain a local decision
    and MAY depend on the capability of local operating systems. Prior to
    the exchange of pathnames they SHOULD be converted into a ISO/IEC
    10646 format and UTF-8 encoded. This approach, while allowing
    international exchange of pathnames, will still allow backward
    compatibility with older systems because the code set positions for
    ASCII characters are identical to the one byte sequence in UTF-8.


    - The 7-bit restriction for pathnames exchanged is dropped.


    - Conforming clients and servers MUST support UTF-8 for the transfer
      and receipt of pathnames. Clients and servers MAY in addition give
      users a choice of specifying interpretation of pathnames in another
      encoding. Note that configuring clients and servers to use
      character sets / encoding other than UTF-8 is outside of the scope
      of this document. While it is recognized that in certain
      operational scenarios this may be desirable, this is left as a
      quality of implementation and operational issue.

    - Pathnames are sequences of bytes.  The encoding of names that are
      valid UTF-8 sequences is assumed to be UTF-8.  The character set of
      other names is undefined. Clients and servers, unless otherwise
      configured to support a specific native character set, MUST check
      for a valid UTF-8 byte sequence to determine if the pathname being
      presented is UTF-8.

    - To avoid data loss, clients and servers SHOULD use the UTF-8
      encoded pathnames when unable to convert them to a usable code set.


    - Clients which do not require display of pathnames are under no
      obligation to do so. Non-display clients do not need to conform to
      requirements associated with display.

    - Clients, which are presented UTF-8 pathnames by the server, SHOULD
      parse UTF-8 correctly and attempt to display the pathname within
      the limitation of the resources available.

    - Clients MUST support the FEAT command and recognize the "UTF8"
      feature (defined in 3.2 above) to determine if a server supports
      UTF-8 encoding.

    - Character semantics of other names shall remain undefined. If a
      client detects that a server is non UTF-8, it SHOULD change its
      display appropriately. How a client implementation handles non
      UTF-8 is a quality of implementation issue. It MAY try to assume
      some other encoding, give the user a chance to try to assume
      something, or save encoding assumptions for a server from one FTP
      session to another.


    - Many existing clients interpret 8-bit pathnames as being in the
      local character set. They MAY continue to do so for pathnames that
      are not valid UTF-8.


    Greetings and responses issued prior to language negotiation SHALL be
    in the server's default language. Paragraph 4.5 of [RFC2277] state
    that this "default language MUST be understandable by an English-
    speaking person". This specification RECOMMENDS that the server
    default language be English encoded using ASCII. This text may be
    augmented by text from other languages. Once negotiated, server-PI
    MUST return server messages and textual part of command responses in
    the negotiated language and encoded in UTF-8. Server-PI MAY wish to
    re-send previously issued server messages in the newly negotiated
    language.






At 14:28 02/10/04 -0700, McDonald, Ira wrote:
>Hi Francois,
>
>Please look at RFC 2640 "Internationalization of FTP" (July 1999,
>Proposed Std status currently), which says:
>
>
>2.1 International Character Set
>
>    The character set defined for international support of FTP SHALL be
>    the Universal Character Set as defined in ISO 10646:1993 as amended.
>    This standard incorporates the character sets of many existing
>    international, national, and corporate standards. ISO/IEC 10646
>    defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a
>    four byte (31 bit) encoding containing 2**31 code positions divided
>    into 128 groups of 256 planes. Each plane consists of 256 rows of 256
>    cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane
>    zero or the Basic Multilingual Plane (BMP).  Currently, no codesets
>    have been defined outside of the 2 byte BMP.
>
>    The Unicode standard version 2.0 [UNICODE] is consistent with the
>    UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0
>    includes the repertoire of IS 10646 characters, amendments 1-7 of IS
>    10646, and editorial and technical corrigenda.
>
>
>2.2 Transfer Encoding
>
>    UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2
>    or UTF-FSS, SHALL be used as a transfer encoding to transmit the
>    international character set. UTF-8 is a file safe encoding which
>    avoids the use of byte values that have special significance during
>    the parsing of pathname character strings. UTF-8 is an 8 bit encoding
>    of the characters in the UCS. Some of UTF-8's benefits are that it is
>    compatible with 7 bit ASCII, so it doesn't affect programs that give
>    special meanings to various ASCII characters; it is immune to
>    synchronization errors; its encoding rules allow for easy
>    identification; and it has enough space to support a large number of
>    character sets.
>
><...snip...more description of the details and virtues of UTF-8...>
>
>
>3.2 Servers compliance
>
>    - Servers MUST support the UTF-8 feature in response to the FEAT
>      command [RFC2389]. The UTF-8 feature is a line containing the exact
>      string "UTF8". This string is not case sensitive, but SHOULD be
>      transmitted in upper case. The response to a FEAT command SHOULD
>      be:
>
>         C> feat
>         S> 211- <any descriptive text>
>         S>  ...
>         S>  UTF8
>         S>  ...
>         S> 211 end
>
>    The ellipses indicate placeholders where other features may be
>    included, but are NOT REQUIRED. The one space indentation of the
>    feature lines is mandatory [RFC2389]."
>
>
>Such an FTP server explicitly negotiates with the FTP client that they
>BOTH support UTF-8 for the transfer encoding.  It thus becomes the
>responsibility of the CLIENT to previously convert legacy encodings
>to UTF-8.  The target system will receive and (hopefully) store the
>transferred file in UTF-8.
>
>Cheers,
>- Ira McDonald
>   High North Inc
>
>
>-----Original Message-----
>From: Francois Yergeau [mailto:FYergeau@alis.com]
>Sent: Friday, October 04, 2002 3:53 PM
>To: ietf-charsets@iana.org
>Subject: RE: Comments on draft-yergeau-rfc2279bis-00.txt
>
>
>Martin Duerst wrote:
> > As far as I understand most contributions on the list in the past
> > day or so, the standard should discourage the BOM, but it currently
> > doesn't.
>
>That much is clear.  It seems there will have to be a draft-03 with some
>additional language in that direction.
>
>
> > > > UTF-8 never needs a 'byte-order' signature.
> > >
> > >This is unfortunately not true, except in the limited realm
> > >of properly internationalized protocols
> >
> > As for example IETF protocols.
>
>Errr, some IETF protocols.  I have no way to tell an FTP server what is the
>charset of a file I'm uploading, nor does the server have any way of telling
>me the charset of a file I'm downloading.  And even if it had a way (like in
>HTTP), the server most probably wouldn't know and would either not tell or
>lie.
>
>--
>Fran輟is
Received on Saturday, 5 October 2002 02:15:47 UTC