W3C home > Mailing lists > Public > uri@w3.org > August 2005

Comment on draft-hoffman-file-uri-03.txt

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 18 Aug 2005 17:53:08 +0900
Message-Id: <>
To: uri@w3.org, Paul Hoffman <phoffman@imc.org>, Ted Hardie <hardie@qualcomm.com>
Cc: Dan Connolly <connolly@w3.org>

Hello Paul, Ted, others,

Here is a comment regarding

This draft is listed as AD Evaluation::AD Followup at

If this comment is late for actual drafting, please consider
it as part of IETF Last Call.

The draft says:

3.4  Character sets and encodings

    Local file systems sometimes use many different encodings for
    representing file names.  For interoperability sake, it would be
    preferable for file: URI libraries to translate the native character
    encoding for file names to and from Unicode.

This is a start in the right direction, but somewhat unaccurate.
I'll list the problems first, and then propose some new text.
There are several problems:

1) Some local file systems indeed use many different encodings for
    representing file names, but on those file systems, transcoding
    filenames to and from Unicode may be very difficult. The typical
    example here is Unix/Linux/... At the OS level, file names are
    byte strings. A user's locale setting (LANG environment variable)
    defines how there bytes are interpreted as characters. Different
    user's milages may vary, unless there is a convention that is
    enforced system-wide. (fortunately, the convention of using UTF-8
    for filenames is on the rise, in particular for Linux).

2) "to and from Unicode" is not well defined. UTF-8? UTF-16? UTF-16LE?

3) The above paragraph is written in terms of "file: URI libraries",
    rather than starting from the scheme syntax.

Here is proposed replacement text. Any comments welcome!

3.4  Character sets and encodings

    Local file systems use all kinds of specific encodings, and sometimes
    many different encodings, for representing file and directory names.
    For interoperability, it is preferable for file: URIs to use UTF-8
    [STD63] (percent-encoded when necessary) in accordance with Section
    2.5 of [RFC3986] and for compatibility with IRIs [RFC3987].
    Applications creating file: URIs should transcode file and directory
    names to UTF-8. Applications interpreting file: URIs should transcode
    back to the encoding(s) used by the file system. For file systems where
    the encoding used cannot be determined with reasonable reliability,
    the actual byte values used by the file system may have to be directly
    encoded in the file: URI.

I can provide some more text talking about specific systems.

Regards,     Martin. 
Received on Thursday, 18 August 2005 08:53:41 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:09 UTC