Re: File API: Blob and underlying file changes. from Dmitry Titov on 2010-01-21 (public-webapps@w3.org from January to March 2010)

From: Dmitry Titov <dimich@chromium.org>
Date: Thu, 21 Jan 2010 13:18:20 -0800
To: Jonas Sicking <jonas@sicking.cc>
Cc: Eric Uhrhane <ericu@google.com>, Jian Li <jianli@chromium.org>, Darin Fisher <darin@chromium.org>, Chris Prince <cprince@google.com>, arun@mozilla.com, Web Applications Working Group WG <public-webapps@w3.org>
Message-ID: <28040fc61001211318w5d9c097ake80c422af1a689d9@mail.gmail.com>
I think the 'snapshotting' discussed above does not imply the actual copy of
data, sync or async. The proposal seems to be to 'snapshot' enough
information (in case of file on a disk - the modification time is enogh) so
that later read operations can fail reliably if the Blob is out of sync with
underlying storage. Making copies of large video files will probably never
be a feasible option, for size/time issues and for potentially quite
complicated lifetime of such copies... We might provide a separate API for
file manipulation that can be used to make temporary copies of files in
cases where it is a good idea, and that could be used in conjunction with
Blob API perhaps, but it seems to be a separate functionality. It is also
interesting to think of Blobs backed by some other objects, Canvas for
example.

Perhaps 'snapshotting' is not an ideal name, but I think discussion above
means it as "capture the state of the underlying object so the data can be
read in the future" but w/o a guarantee that the read operation will
actually succeed - since there can not be a guarantee that underlying object
is still there.

On Thu, Jan 21, 2010 at 12:49 PM, Jonas Sicking <jonas@sicking.cc> wrote:

> One thing to remember here is that if we require snapshotting, that
> will mean paying potentially very high costs every time the
> snapshotting operation is used. Potetially copying hundreds of
> megabytes of data (think video).
>
> But if we don't require snapshotting, things will only break if the
> user takes the action to modify a file after giving the page access to
> it.
>
> Also, in general snapshotting is something that UAs can experiment
> with without requiring changes to the spec. Even though File.slice is
> a synchronous function, the UA can implement snapshotting without
> using synchronous IO. The UA could simply do a asynchronous file copy
> in the background. If any read operations are performed on the slice
> those could simply be stalled until the copy is finished since reads
> are always asynchronous.
>
> / Jonas
>
> On Thu, Jan 21, 2010 at 11:22 AM, Eric Uhrhane <ericu@google.com> wrote:
> > On Thu, Jan 21, 2010 at 11:15 AM, Jian Li <jianli@chromium.org> wrote:
> >> Treating blobs as snapshots sounds like a reasonable approach and it
> will
> >> make the life of the chunked upload and other scenarios easier. Now the
> >> problem is: how do we get the blob (snapshot) out of the file?
> >> 1) We can still keep the current relationship between File and Blob.
> When we
> >> slice a file by calling File.slice, a new blob that captures the current
> >> file size and modification time is returned. The following Blob
> operations,
> >> like slice, will simply inherit the cached size and modification time.
> When
> >> we access the underlying file data in XHR.send() or FileReader, the
> >> modification time will be verified and an exception could be thrown.
> >
> > This would require File.slice to do synchronous file IO, whereas
> > Blob.slice doesn't do that.
> >
> >> 2) We can remove the inheritance of Blob from File and introduce
> >> File.getAsBlob() as dimich suggested. This seems to be more elegant.
> >> However, it requires changing the File API spec a lot.
> >>
> >> On Wed, Jan 20, 2010 at 3:44 PM, Eric Uhrhane <ericu@google.com> wrote:
> >>>
> >>> On Wed, Jan 20, 2010 at 3:23 PM, Dmitry Titov <dimich@chromium.org>
> wrote:
> >>> > On Wed, Jan 20, 2010 at 2:30 PM, Eric Uhrhane <ericu@google.com>
> wrote:
> >>> >>
> >>> >> I think it could.  Here's a third option:
> >>> >> Make all blobs, file-based or not, just as async as the blobs in
> >>> >> option 2.  They never do sync IO, but could potentially fail future
> >>> >> read operations if their metadata is out of date [e.g. reading
> beyond
> >>> >> EOF].  However, expose the modification time on File via an async
> >>> >> method and allow the user to pass it in to a read call to enforce
> >>> >> "fail if changed since this time".  This keeps all file accesses
> >>> >> async, but still allows for chunked uploads without mixing files
> >>> >> accidentally.  If we allow users to refresh the modification time
> >>> >> asynchronously, it also allows for adding a file to a form, changing
> >>> >> the file on disk, and then uploading the new file.  The user would
> >>> >> look up the mod time when starting the upload, rather than when the
> >>> >> file's selected.
> >>> >
> >>> > It would be great to avoid sync file I/O on calls like Blob.size.
> They
> >>> > would
> >>> > simply return cached value. Actual mismatch would be detected during
> >>> > actual
> >>> > read operation.
> >>> > However then I'm not sure how to keep File derived from Blob, since:
> >>> > 1) Currently, in FF and WebKit File.fileSize is a sync I/O that
> returns
> >>> > current file size. The current spec says File is derived from Blob
> and
> >>> > Blob
> >>> > has Blob.size property that is likely going to co-exist with
> >>> > File.fileSize
> >>> > for a while, for compat reasons. It's weird for file.size and
> >>> > file.fileSize
> >>> > to return different things.
> >>>
> >>> True, but we'd probably want to deprecate file.fileSize anyway and
> >>> then get rid of it, since it's synchronous.
> >>>
> >>> > 2) Currently, xhr.send(file) does not fail and sends the version of
> the
> >>> > file
> >>> > that exists somewhere around xhr.send(file) call was issued. Since
> File
> >>> > is
> >>> > also a Blob, xhr.send(blob) would behave the same which means if we
> want
> >>> > to
> >>> > preserve this behavior the Blob can not fail async read operation if
> >>> > file
> >>> > has changed.
> >>> > There is a contradiction here. One way to resolve it would be to
> break
> >>> > "File
> >>> > is Blob" and to be able to "capture the File as Blob" by having
> >>> > file.getAsBlob(). The latter would make a snapshot of the state of
> the
> >>> > file,
> >>> > to be able to fail subsequent async read operations if the file has
> been
> >>> > changed.
> >>> > I've asked a few people around in a non-scientific poll and it seems
> >>> > developers expect Blob to be a 'snapshot', reflecting the state of
> the
> >>> > file
> >>> > (or Canvas if we get Canvas.getBlob(...)) at the moment of Blob
> >>> > creation.
> >>> > Since it's obviously bad to actually copy data, it seems acceptable
> to
> >>> > capture enough information (like mod time) so the read operations
> later
> >>> > can
> >>> > fail if underlying storage has been changed. It feels really strange
> if
> >>> > reading the Blob can yield some data from one version of a file (or
> >>> > Canvas)
> >>> > mixed with some data from newer version, without any indication that
> >>> > this is
> >>> > happening.
> >>> > All that means there is an option 3:
> >>> > 3. Treat all Blobs as 'snapshots' that refer to the range of
> underlying
> >>> > data
> >>> > at the moment of creation of the Blob. Blobs produced further by
> >>> > Blob.slice() operation inherit the captured state w/o actually
> verifying
> >>> > it
> >>> > against 'live' underlying objects like files. All Blobs can be 'read'
> >>> > (or
> >>> > 'sent') via operations that can fail if the underlying content has
> >>> > changed.
> >>> > Optionally, expose snapshotTime property and perhaps "read if not
> >>> > changed
> >>> > since" parameter to read operations. Do not derive File from Blob,
> >>> > rather
> >>> > have File.getAsBlob() that produces a Blob which is a snapshot of the
> >>> > file
> >>> > at the moment of call. The advantage here is that it removes need for
> >>> > sync
> >>> > operations from Blob and provides mechanism to ensure the changing
> >>> > underlying storage is detectable. The disadvantage is a bit more
> >>> > complexity
> >>> > and bigger change to File spec.
> >>>
> >>> That sounds good to me.  If we're treating blobs as snapshots, I
> >>> retract my suggestion of the read-if-not-changed-since parameter.  All
> >>> reads after the data has changed should fail.  If you want to do a
> >>> chunked upload, don't snapshot your file into a blob until you're
> >>> ready to start.  Once you've done that, just slice off parts of the
> >>> blob, not the file.
> >>
> >>
> >
>
Received on Thursday, 21 January 2010 21:18:49 UTC