- From: Mike Brown <mike@skew.org>
- Date: Tue, 13 Jul 2004 06:04:08 -0600 (MDT)
- To: uri@w3.org
Graham Klyne wrote: > I've done some implementation work recently that figures a URI from a > filename and vice versa. For Unix-like systems, I think the correspondence > is pretty clear, but for Windows I needed to engage in some guesswork about > how to deal with device (drive) names. Part of my code looks like this: > [[ > -- strip off leading '/' from Windows drive name > source = fileuripath (path uri) > fileuripath ('/':file@(d:':':more)) | driveLetter d = file > fileuripath file = file > driveLetter d = d `elem` ['A'..'Z'] > ]] > That is, on windows systems, FILE://localhost/D:/dir/file is treated as a > reference to file D:\dir\file on the current host system. But other > software I have seen in the past uses '|' in place of the ':'. I'm not > sure what is the current preferred approach. This topic is worthy of a separate thread, so I'm spinning it off now. If we could get more implementers to agree on best practices for converting OS-specific filesystem paths to URIs *and* vice-versa, it would be a Good Thing. First, a couple of references: RFC 1738: http://www.ietf.org/rfc/rfc1738 Summarizes the format of several URL schemes, two of which have been made obsolete by RFCs 2616 (HTTP/1.1) and 2368 (mailto). Also provides generic URL syntax and related rules that have been superceded by RFCs 1808 and 2396. The 'file' scheme defined here is still in effect. RFC 1738bis (current draft): http://www.ietf.org/internet-drafts/draft-hoffman-rfc1738bis-02.txt (I think the date in it is supposed to say April 19, 2004, not 2003, since the previous draft was dated October 2003). An attempt to preserve and update the URL scheme summaries that have not been made obsolete by RFCs 2616 (HTTP/1.1) and 2368 (mailto). RFC 2396bis (current draft): http://www.gbiv.com/protocols/uri/rev-2002/rfc2396bis.html I think we are all familiar with this one. Second, the 'file' URI scheme as defined in RFC 1738 leaves much up to the interpreter. It might be argued that it has no choice but to leave things ambiguous, because IETF RFCs apply only to the Internet, while the 'file' scheme is defined as a non-Internet protocol. It might very well be beyond the scope of an RFC to mandate how to derive a URI from an OS-specific path and vice-versa (I've no idea if this is the case, I'm just saying...) Third, things are not as straightforward as you suggest, even in Unix. When converting from any filesystem path to a URI, questions to consider include: - For what kind of filesystem / OS is the path? - Windows, MS-DOS - Unix / POSIX (Linux, FreeBSD, Solaris, Mac OS X, Cygwin, etc.) - legacy Mac OS (OS 9 and prior) - (etc.) - If the path's filesystem / OS is not given, what do you do? - assume the path is appropriate for the local OS? - reject the path? - If the path's OS is unsupported, what do you do? - reject the path? - use a default algorithm, like just prepending 'file:' and percent-encoding as required? - Is the path 'absolute'? - If it's a UNIX path, whether it starts with "/" is the only qualification, I believe. - If it's a Windows path, it could be absolute if it matches the regular expression ^(\\|[A-Za-z]:) - that is, it either starts with "\" or a drivespec (an ASCII-range letter followed by ":"). - If the path is not absolute, e.g. it looks like 'the/path', - reject it? - create a relative URI reference? ('the/path') - create an RFC 2396bis-compliant, but RFC 1738-offending, URI like 'file:the/path'? - attempt to make the path absolute by interpreting it to be relative to the local host's 'current working directory', if such a concept exists in the local OS? What if the path is for some other OS? And do you make it absolute according to the OS's conventions first, or do you do an RFC 2396bis conformant resolution of a relative URI reference ('the/path') against the base URI that is derived from the current working directory? - Do you attempt to collapse dot segments (or equivalent) in the path or in the resulting URI? Does it depend on whether the path or URI is absolute? A reason to collapse dot segments in an absolute URI is so that the URI can be suitable for use as a base URI for RFC 2396bis conformant resolution. - Is the mapping between segments in the filesystem path and segments in the path component of the URI well-defined? - On Unix, it should be sufficient to percent-encode all non-unreserved characters. Note that '/' may appear *within* a segment, though (you can put a slash in a filename), so be sure to apply percent-encoding to each segment individually. - On Windows, - If the path purports to be for a particular OS, but does not match that OS's syntax for a path, e.g. 'C:/autoexec.bat' on Windows, - reject the path? - be as lenient as possible, e.g. replace '/' with '\' for Windows? What about '9:\autoexec.bat' on Windows (bad drivespec)? acceptable? - If the path is provided as a sequence of Unicode characters, - form the URI by leaving unreserved characters as-is, and percent- encoding the rest, using UTF-8 as the basis? (RFC 2396bis default) - use some other encoding more appropriate to the path's OS? - If the path is provided as a sequence of bytes, not Unicode characters, with no additional info about encoding, - reject it because it can't be decoded to Unicode? - assume a default encoding? - based on...? How confident can you be about, say, a filesystem default encoding? (probably not very) - attempt no decode; just form the URI by converting to unreserved characters only those bytes that, when decoded as ASCII, correspond to unreserved characters, and percent-encoding the rest of the bytes individually? - For a Windows path, is it in the form of a local path or a UNC path? ("local" may not be the right term) - local, absolute, with drivespec: C:\autoexec.bat - local, absolute, no drivespec: \autoexec.bat - local, relative: the\path - UNC: \\host\share\autoexec.bat - Do you map the UNC host name to the authority component? Don't forget to percent-encode. - Do you leave the UNC share name as the first segment of the path coponent, or..? And don't forget to percent-encode. - Networked instances of Windows do weird things like refer to network printers like this: '\\http://192.168.0.1/printername', and refer to shared drives like this: '\\sharename\$d$\autoexec.bat'. When are these conventions used? I saw the former today, and the latter a few years back on NT4 systems. Are they documented anywhere, and do you want to attempt to deal with them? - For a Windows path, do you do any case normalization, e.g. in the drivespec? ('c:' -> 'C:') - Windows uses ":" in the drivespec (and nowhere else, currently). ":" is a reserved character in a URI, but does not need to be percent-encoded in a path segment. Therefore, 'file:///C:/autoexec.bat' is acceptable as a URI, and is equivalent to 'file:///C%3A/autoexec.bat. There is a convention of using "|", e.g. 'file:///C|/autoexec.bat', I believe because of the ambiguities that arise when you have situations like 'C:/foo' as a relative URL being resolved against, say, 'file:/autoexec.bat' or 'file:C:/autoexec.bat' and so on - things that appear in the wild and may(?) have been canon at one time, but don't play nicely with any relative resolution algorithms. I haven't much sympathy for "|" and feel it should be deprecated as much as possible. Resolvers should continue to accept it and treat it as synonymous with a drivespec ":". On that note, though, should they treat *all* "|" as ":", or just those that appear to be a drivespec? If ":" or "|" ever become legal characters in Windows paths... then what. - Empty segments in the path: collapse them? Depends on OS? (gets tricky round-tripping on Windows with UNC paths.. I'd have to experiment again to give you some good examples though. I decided not to worry about it too much). That's all for now, and that's only touching on the conversion *to* a URI, for just two OSes. The conversion from a URI to an OS path is even worse... -Mike
Received on Tuesday, 13 July 2004 16:24:02 UTC