[whatwg] Drag-and-drop folders/files support with directory structure using DirectoryEntry from Glenn Maynard on 2011-11-16 (public-whatwg-archive@w3.org from November 2011)

From: Glenn Maynard <glenn@zewt.org>
Date: Wed, 16 Nov 2011 13:21:08 -0500
Message-ID: <CABirCh8jm_hXh_ALQ9YfwN1enmQiy2Bv5O8s_u0pmVXee8x8SQ@mail.gmail.com>
On Wed, Nov 16, 2011 at 3:42 AM, Jonas Sicking <jonas at sicking.cc> wrote:

> > That requires a full directory traversal in advance to find all of the
> > files, though; the tree could be very large.
>
> You need to do that anyway to implement the .files attribute, no?
>

.files shouldn't recursively include all files inside directories.  (If you
actually select tens of thousands of files and drag them, then yes, but in
most cases when you have that many files, they're split into directories
and you don't normally drag them individually.)


On Wed, Nov 16, 2011 at 9:59 AM, Kinuko Yasuda <kinuko at chromium.org> wrote:

>  The unsandboxed storage and actual data doesn't belong to origin, but
> the 'origin-specific' concept can be applied to the filesystem
> namespace.
>
> I haven't thought about workers cases deeply yet, but am thinking that
> we should prohibit access to the dropped folders from the other pages
> than the one that received the drop event.


Access to a file should just be limited by whoever has an Entry object
pointing at it.  The Entry object is essentially a token granting access to
its associated file(s).



> As for the entry URLs I'm planning to make the URLs to the dropped entries
> and the filesystem
> namespace (that only contains the dropped files) expire when the page
> goes away, hoping this would largely simplify the lifetime and
> security issues.
>

I don't think it's possible to do this correctly, because URLs created with
toURL have no equivalent to revokeObjectURL.  A long-running page has no
way to avoid "leaking" these references until the page exits.  Adding a
revoke method for toURL would essentially turn it into URL.createObjectURL.

Needing to revoke URLs when dealing with worker communication also makes it
very hard for users to get it right.  For example, suppose a Window sends a
toURL-generated URL to a Worker.  How do you ensure that the URL is revoked
after the worker has received it and finished converting it back to an
Entry?  The Worker might be killed (eg. due to CPU quotas) at any time,
making avoiding resource leaks very hard.

These are just the usual problems with manual resource management, which
should be avoided if at all possible.  We already have a mechanism that
cleanly avoids all of this, with structured clone and File.

 > Off-hand, the main issue that directly affects reading is that most
> > non-Windows filesystems can store filenames which can't be represented
> by a
> > DOMString, such as invalid codepoints (most commonly mismatched
> encodings).
>
> How do they appear in File.name in existing .files approach?
>

I don't have a Linux browser to check.  I'm guessing it won't inform us
much here, since that didn't have to worry about general file access.

A naive solution in filesystem approach would be silently ignoring
> such files (probably bad) or having in-memory path mapping (would be
> slightly better).  For limited read-only drag-and-drop cases we
> wouldn't need to think about remapping and the mapping could just go
> away when the page goes away, so hopefully implementing such mapping
> wouldn't be that hard.
>

There are probably some cases that we'll just have to accept will never
work perfectly, and design with that in mind.

To take a common case, suppose a script does the following, a commonplace
method for safe file overwriting (relatively; the needed flush operations
don't exist here):

1. Create a file with the name filename + ".new".
2. Write the new file contents to the file.
3. Rename filename + ".new" to filename, overwriting the original file.

This is a useful case: it's real-world--I've done this countless times--and
it's a case where unrepresentable filenames affects both reading and
writing, plus the auxiliary operation of renaming.

I suppose the mapping approach could work here.  Associate the mapping with
the DirectoryEntry containing it, from invalid filenames to generated
filenames.  Then, if the invalid filename is "X", and the DOMString mapping
is "MAPPING1", then this would first create the literal filename
"MAPPING1.new", followed by renaming it to the original "invalid" filename
"X".

(In particular, though, I think it should not be possible to create *new*
garbage filenames on people's systems, that didn't exist to begin with.
That is, it should map to the filenames that really exist, not just string
escaping.)

This is complex, though, and leads to new questions, like how long the
mappings last if the underlying file is deleted.  As a data point, note
that most Windows applications are unable to access files whose filenames
can't be represented in the current ANSI codepage.  That is, if you're on a
US English system, you can't access filenames with Japanese in them.
(Unicode applications can, but tons of applications in Windows aren't
Unicode; Windows has never made it simple to support Unicode.)  If users
find that reasonable, it might not be worth all this for the even rarer
case of illegal codepoints in Linux.

Yup, writing side would have tougher issues, and that's why I started
> this proposal only with read-only scenarios.  (I agree that it'd be
> good to give another thought about unsandboxed writing cases though)
>

For what it's worth, I think the only sane approach here is an isolated
break from attempting to make everything interoperable, and allow the
platform's limitations to be visible.  (That is, fail file creation if the
path depth or filename length is too long on the platform; succeed with
file creation even if it would fail on a different platform, and so on.)  I
think this is just inherent to allowing this sort of access to real
filesystems, and trying to avoid it just causes other, stranger problems.

(For example, if you prevent creating filenames in Linux which are illegal
in Windows, then things get strange if an "illegal" filename already exists
on a filesystem where it's not actually disallowed.)



On Wed, Nov 16, 2011 at 12:01 PM, Eric U <ericu at google.com> wrote:

>  While the URL format for non-sandboxed files has yet to be worked out,
> I think we need toURL to work no matter where the file comes from.
> It's already the case that an Entry can expire if the underlying file
> is deleted or moved;


But there's no revocation mechanism for toURL URLs.

Also, if toURL URLs to non-sandboxed storage expires with the context it
was created in (which it would have to, I think), it loses a whole category
of use cases covered by structured clone: the ability to persist an access
token.  For example, the spec allows storing a File within a History
state.  That allows history navigation to restore its state properly: if
the user opened a local picture into an image viewer app, navigating
through history can correctly show the files in older history states, and
even restore correctly through browser restarts and session restores.  The
same should apply to Entry and DirectoryEntry.

(Nobody implements this yet, as far as I know, but I hope it'll happen
eventually.  It's a limitation today, and it'll become a more annoying one
as local file access mechanisms like this one are fleshed out.)

Also, if non-sandboxed toURL URLs are same-origin only, then that also
loses functionality that structured cloning allows: using Web Messaging to
pass an access token to a page with a different origin.  (This is much
safer than allowing cross-origin use of the URLs, since it's far easier to
accidentally expose a URL string than to accidentally transfer an object.)

File API has already solved all of this by using structured clone.  I think
it makes a lot of sense to follow its lead.

-- 
Glenn Maynard
Received on Wednesday, 16 November 2011 10:21:08 UTC