'file' URI conventions

Graham Klyne wrote:
> I've done some implementation work recently that figures a URI from a 
> filename and vice versa.  For Unix-like systems, I think the correspondence 
> is pretty clear, but for Windows I needed to engage in some guesswork about 
> how to deal with device (drive) names.  Part of my code looks like this:
> [[
>      -- strip off leading '/' from Windows drive name
>      source      = fileuripath (path uri)
>      fileuripath ('/':file@(d:':':more)) | driveLetter d = file
>      fileuripath file = file
>      driveLetter d = d `elem` ['A'..'Z']
> ]]
> That is, on windows systems,  FILE://localhost/D:/dir/file is treated as a 
> reference to file D:\dir\file on the current host system.   But other 
> software I have seen in the past uses '|' in place of the ':'.  I'm not 
> sure what is the current preferred approach.

This topic is worthy of a separate thread, so I'm spinning it off now.

If we could get more implementers to agree on best practices for converting 
OS-specific filesystem paths to URIs *and* vice-versa, it would be a Good 
Thing.

First, a couple of references:

  RFC 1738:
  http://www.ietf.org/rfc/rfc1738
  Summarizes the format of several URL schemes, two of which have been
  made obsolete by RFCs 2616 (HTTP/1.1) and 2368 (mailto). Also provides
  generic URL syntax and related rules that have been superceded by RFCs
  1808 and 2396. The 'file' scheme defined here is still in effect.

  RFC 1738bis (current draft):
  http://www.ietf.org/internet-drafts/draft-hoffman-rfc1738bis-02.txt
  (I think the date in it is supposed to say April 19, 2004, not 2003,
  since the previous draft was dated October 2003). An attempt to
  preserve and update the URL scheme summaries that have not been made
  obsolete by RFCs 2616 (HTTP/1.1) and 2368 (mailto).

  RFC 2396bis (current draft):
  http://www.gbiv.com/protocols/uri/rev-2002/rfc2396bis.html
  I think we are all familiar with this one.

Second, the 'file' URI scheme as defined in RFC 1738 leaves much up to the 
interpreter. It might be argued that it has no choice but to leave things 
ambiguous, because IETF RFCs apply only to the Internet, while the 'file' 
scheme is defined as a non-Internet protocol. It might very well be beyond the 
scope of an RFC to mandate how to derive a URI from an OS-specific path and 
vice-versa (I've no idea if this is the case, I'm just saying...)

Third, things are not as straightforward as you suggest, even in Unix.

When converting from any filesystem path to a URI,
questions to consider include:

- For what kind of filesystem / OS is the path?
  - Windows, MS-DOS
  - Unix / POSIX (Linux, FreeBSD, Solaris, Mac OS X, Cygwin, etc.)
  - legacy Mac OS (OS 9 and prior)
  - (etc.)

- If the path's filesystem / OS is not given, what do you do?
  - assume the path is appropriate for the local OS?
  - reject the path?

- If the path's OS is unsupported, what do you do?
  - reject the path?
  - use a default algorithm, like just prepending 'file:' and
    percent-encoding as required?

- Is the path 'absolute'?
  - If it's a UNIX path, whether it starts with "/" is the only
    qualification, I believe.
  - If it's a Windows path, it could be absolute if it matches the
    regular expression ^(\\|[A-Za-z]:) - that is, it either starts
    with "\" or a drivespec (an ASCII-range letter followed by ":").

- If the path is not absolute, e.g. it looks like 'the/path',
   - reject it?
   - create a relative URI reference? ('the/path')
   - create an RFC 2396bis-compliant, but RFC 1738-offending,
     URI like 'file:the/path'?
   - attempt to make the path absolute by interpreting it to be
     relative to the local host's 'current working directory', if
     such a concept exists in the local OS?
     What if the path is for some other OS?
     And do you make it absolute according to the OS's conventions
     first, or do you do an RFC 2396bis conformant resolution of
     a relative URI reference ('the/path') against the base URI
     that is derived from the current working directory?

- Do you attempt to collapse dot segments (or equivalent) in the
  path or in the resulting URI? Does it depend on whether the path
  or URI is absolute? A reason to collapse dot segments in an
  absolute URI is so that the URI can be suitable for use as a base
  URI for RFC 2396bis conformant resolution.

- Is the mapping between segments in the filesystem path and
  segments in the path component of the URI well-defined?
  - On Unix, it should be sufficient to percent-encode all
    non-unreserved characters. Note that '/' may appear *within*
    a segment, though (you can put a slash in a filename), so be
    sure to apply percent-encoding to each segment individually.
  - On Windows,

- If the path purports to be for a particular OS, but does not match
  that OS's syntax for a path, e.g. 'C:/autoexec.bat' on Windows,
  - reject the path?
  - be as lenient as possible, e.g. replace '/' with '\' for Windows?
    What about '9:\autoexec.bat' on Windows (bad drivespec)? acceptable?

- If the path is provided as a sequence of Unicode characters,
  - form the URI by leaving unreserved characters as-is, and percent-
    encoding the rest, using UTF-8 as the basis? (RFC 2396bis default)
  - use some other encoding more appropriate to the path's OS?

- If the path is provided as a sequence of bytes, not Unicode characters,
  with no additional info about encoding,
  - reject it because it can't be decoded to Unicode?
  - assume a default encoding?
    - based on...? How confident can you be about, say, a filesystem
      default encoding? (probably not very)
  - attempt no decode; just form the URI by converting to unreserved 
    characters only those bytes that, when decoded as ASCII, correspond
    to unreserved characters, and percent-encoding the rest of the bytes
    individually?
   
- For a Windows path, is it in the form of a local path or a UNC path?
  ("local" may not be the right term)
  - local, absolute, with drivespec:   C:\autoexec.bat
  - local, absolute, no drivespec:     \autoexec.bat
  - local, relative:                   the\path
  - UNC:                               \\host\share\autoexec.bat

- Do you map the UNC host name to the authority component?
  Don't forget to percent-encode.

- Do you leave the UNC share name as the first segment of the path
  coponent, or..? And don't forget to percent-encode.

- Networked instances of Windows do weird things like refer to network
  printers like this: '\\http://192.168.0.1/printername', and refer to
  shared drives like this: '\\sharename\$d$\autoexec.bat'. When are these
  conventions used? I saw the former today, and the latter a few years
  back on NT4 systems. Are they documented anywhere, and do you want to
  attempt to deal with them?

- For a Windows path, do you do any case normalization, e.g. in the
  drivespec? ('c:' -> 'C:')

- Windows uses ":" in the drivespec (and nowhere else, currently).
  ":" is a reserved character in a URI, but does not need to be
  percent-encoded in a path segment. Therefore, 'file:///C:/autoexec.bat'
  is acceptable as a URI, and is equivalent to 'file:///C%3A/autoexec.bat.

  There is a convention of using "|", e.g. 'file:///C|/autoexec.bat', I believe 
  because of the ambiguities that arise when you have situations like 'C:/foo'
  as a relative URL being resolved against, say, 'file:/autoexec.bat' or
  'file:C:/autoexec.bat' and so on - things that appear in the wild and may(?)
  have been canon at one time, but don't play nicely with any relative
  resolution algorithms.

  I haven't much sympathy for "|" and feel it should be deprecated as much
  as possible. Resolvers should continue to accept it and treat it as
  synonymous with a drivespec ":". On that note, though, should they treat
  *all* "|" as ":", or just those that appear to be a drivespec?

  If ":" or "|" ever become legal characters in Windows paths... then what.

- Empty segments in the path: collapse them? Depends on OS?
  (gets tricky round-tripping on Windows with UNC paths.. I'd have to
   experiment again to give you some good examples though. I decided not to
   worry about it too much).

That's all for now, and that's only touching on the conversion *to* a URI,
for just two OSes. The conversion from a URI to an OS path is even worse...

-Mike

Received on Tuesday, 13 July 2004 16:24:02 UTC