[whatwg/fetch] Should file path <-> file URL handling be standardized? (#1338)

I've been trying to work out a good algorithm for this, looking at the 3 major browsers for guidance and compatibility (I would like any file URLs I create to work when given to a browser or other application). As far as I can tell, they all seem to have slightly different behaviour, especially when it comes to quirky edge-cases such as POSIX paths with invalid UTF-8, or Windows paths with invalid UTF-16, and the test-suites for each seem to be quite bare, without coverage for many edge-cases or negative tests.

I should preface this by saying that I'm not terribly familiar with any of these codebases. I've been digging through them in attempt to understand how they might interpret file URLs I create, or how I should interpret file URLs they create, but they are enormous, complex projects (as I'm sure most readers will know all too well); it's certainly possible that I have some of these details wrong, and I would appreciate any corrections from those who are more familiar with them. Also, I'm not trying to call anybody out or criticise any of these projects - paths are just horrible; it's not surprising that there is some divergent behaviour when it comes to handling them.

This seems to be a feature that most modern browsers have to support, and I think there is value in ensuring that they all produce the same URLs for the same file paths, the same file paths for the same URLs, and that other applications can produce and consume file URLs in a way that is compatible with whichever browser the user prefers.

## Quick recap of file paths

File paths on POSIX-y systems are semi-arbitrary bytes. The ASCII forward-slash (`/`, 0x2F) is reserved as a path separator, the ASCII nul (0x00) typically is reserved to mark the end of the byte-string, and components of one or two ASCII periods (`.`, 0x2E) are references to the current or parent directory, respectively (I spent a while sweating over this, because periods are not typically listed as reserved characters, but it seems [the Linux kernel interprets them](https://github.com/torvalds/linux/blob/303392fd5c160822bf778270b28ec5ea50cab2b4/fs/namei.c#L2235), so I guess that's fine?). Otherwise, there are no restrictions on what a file or directory name may contain; they're just bytes, maybe it's not even correct to interpret them as "text" at all.

Some systems, such as Apple's Darwin-based OSes, guarantee that file and directory names are valid UTF-8 text; the system just won't allow you to create a file whose name can't be represented in UTF-8. The filesystem may perform some Unicode normalization - so if you create a file using a certain sequence of bytes and list the directory, you may not see a file with the same sequence of bytes, but a Unicode-aware comparison function will confirm that a file of the given name exists. 

On Windows, file paths are UTF-16. There are more reserved characters, such as backslashes (`\`, 0x5C), question-marks and colons (`?`, 0x3F, and `:`, 0x3A, respectively), but it's better than plain POSIX because you at least have an encoding and the system tells us it's okay to interpret file names as text. That being said, Windows does not actually enforce that file names are _well-formed_ UTF-16, so they can contain things like unpaired surrogates, which are forbidden from being transcoded to UTF-8.

## Chromium

The main routines appear to be `net/base/FilePathToFileURL` and `net/base/FileURLToFilePath` (implemented [here](https://github.com/chromium/chromium/blob/87ca5380f1d05f7218edd5659ed857b16a0bb2d0/net/base/filename_util.cc#L29)). There are a [handful](https://github.com/chromium/chromium/blob/c7071ba573676de2b7f81163a39d5cdac135d125/net/base/filename_util_unittest.cc#L173) of tests for Windows- and POSIX-style paths.

There are also lots of gaps - weird paths/URLs with repeated slashes don't appear to have much test coverage, IPv6 addresses are not transliterated for Windows UNC paths (UNC paths apparently can't contain IPv6 addresses, so a file URL like `file://[2001:db8::]/foo` should be turned in to the UNC path `\\2001-db8--.ipv6-literal.net\foo` - the platform recognizes this domain and resolves it locally). 

As for non-UTF8 path components, Chromium's `FilePath` stores its path in the platform's native format - either an `std::string` for POSIX-y systems or `std::wstring` for Windows. It includes a [`.asUTF8Unsafe()`](https://github.com/chromium/chromium/blob/c7071ba573676de2b7f81163a39d5cdac135d125/base/files/file_path.h#L397) method which, on those POSIX systems where UTF-8 can't be assumed, invokes the standard C multi-byte-to-wide-char conversion routines followed by a wide-char-to-utf8 conversion. This assumes the byte string is some locale-dependent text, rather than arbitrary bytes. `FilePathToFileURL` calls this method, so the URLs it produces always decode as valid UTF-8 file and directory names, but the transcoding is not reversed by `FileURLToFilePath`. That said, if the file URL does contain invalid UTF-8, `FileURLToFilePath` will preserve it in the POSIX path. On Windows, the system's wide-char-to-utf8 conversion is used (rather than the C version), so it's unclear what ill-formed UTF-16, such as unpaired surrogates, will produce. I don't _think_ it's tested?

Additionally, Chromium does some basic normalization, such as collapsing repeated slashes.

## WebKit

The main routines appear to be [`URL::fileURLWithFileSystemPath`](https://github.com/WebKit/WebKit/blob/1e5a9791388c8e05e165a8c9168f280440301fe3/Source/WTF/wtf/URL.cpp#L1035) and [`URL::fileSystemPath`](https://github.com/WebKit/WebKit/blob/1e5a9791388c8e05e165a8c9168f280440301fe3/Source/WTF/wtf/URL.cpp#L234). 

Reiterating that I'm not familiar with any of these codebases: while I could find tests for other parts of the WTF API, I couldn't find any tests for this particular functionality. I'm also not able to find a low-level "file path" type in WebKit. There seems to be an assumption that all paths are strings and can be interpreted as text, and WebKit's string type appears to support both Latin-1 and UTF-16-encoded strings. I may have this wrong, so I'd appreciate any corrections.

Nonetheless, these routines clearly reject non-local file URLs, and simply percent-encode a handful of special characters (`?`, `#`, and non-ASCII bytes) in to the URL's path component. The percent-encoding routine will also get the contents of the given path as UTF-8 using the default, lenient conversion mode, which encodes unpaired surrogates from ill-formed UTF-16 as UTF-8.

The reverse conversion process will percent-decode the path, then call `fileSystemRepresentation` on Windows only. This function calls the system `WideCharToMultiByte`, converting wide characters to the system's active ANSI code-page. This isn't really advisable, since the ANSI code-page in general cannot represent every Unicode character, so the conversion may be lossy. Additionally, as with Chromium, the transcoding to UTF-8 is not reversed, so the resulting bytes would not be equal to the original path.

## Firefox (assuming servo/rust-url is used?)

Rust-url has [`from_file_path`](https://github.com/servo/rust-url/blob/d673c4d5e22b3a8ac91b7f52faa45dc32a275f75/url/src/lib.rs#L2303) and [`to_file_path`](https://github.com/servo/rust-url/blob/d673c4d5e22b3a8ac91b7f52faa45dc32a275f75/url/src/lib.rs#L2457) methods. These methods work in terms of a native `Path` type which wraps an `OsString` - this string differs from Rust's standard string types in that may not contain valid Unicode text. There are a handful of tests in [this file](https://github.com/servo/rust-url/blob/d673c4d5e22b3a8ac91b7f52faa45dc32a275f75/url/tests/unit.rs#L79), but again, they don't seem to be very comprehensive. Perhaps there are others which I can't find.

On POSIX systems, these bytes are percent-encoded directly in to the URL's path segment, without prior conversion to UTF-8. Similarly, the reverse process decodes file and directory names as raw bytes in to an `OsString`.

On Windows, if the path/`OsString` contains invalid Unicode text, creating a file URL will fail. Similarly, the reverse process will fail if the URL contains encoded invalid UTF-8.

## Summary

There seem to be quite a few approaches to handling file paths. For POSIX-style paths, both Chrome and WebKit seem to destroy the original path information when it is not UTF-8, whilst rust-url will preserve these paths exactly. Windows paths are _supposed_ to be Unicode by definition, but when invalid Unicode is encountered, browsers run the gamut of behaviour from "whatever the system wants" to allowing or rejecting. Again, my personal opinion is that rust-url's behaviour is the most reasonable, as these kinds of file names should be exceedingly rare (and hopefully one day Microsoft will patch Windows to just disallow them, as Apple's platforms disallow invalid UTF-8).

As a developer of non-browser applications, I would prefer if browsers documented and standardised their behaviour, ideally based on what rust-url does. It might be worth incorporating some basic path normalization, such as collapsing repeated slashes, so that the resulting URLs are easy to work with and display nicely. 

There are other outstanding issues, such as [preserving `localhost`](https://github.com/whatwg/url/issues/618), and how to represent Windows `\\?\foo` paths. Not only is `?` not a valid hostname, and [can't be percent-encoded](https://github.com/whatwg/url/issues/599), but these paths should apparently be sent to the OS directly without any interpretation at all - i.e. you could actually have a file or directory name called `.` or `..` using these paths! It's unclear how to represent those in a file URL conforming to this standard, since it will always interpret `.` and `..` path components - maybe you'd have to percent-encoding the entire thing, path separators and all, as one giant path component?

Thoughts? Ideas? Corrections?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/fetch/issues/1338

Received on Wednesday, 20 October 2021 11:14:54 UTC