Re: File API: Blob and underlying file changes. from Jonas Sicking on 2010-01-21 (public-webapps@w3.org from January to March 2010)

From: Jonas Sicking <jonas@sicking.cc>
Date: Thu, 21 Jan 2010 12:49:28 -0800
To: Eric Uhrhane <ericu@google.com>
Cc: Jian Li <jianli@chromium.org>, Dmitry Titov <dimich@chromium.org>, Darin Fisher <darin@chromium.org>, Chris Prince <cprince@google.com>, arun@mozilla.com, Web Applications Working Group WG <public-webapps@w3.org>
Message-ID: <63df84f1001211249h14f6ca4bxe762a8cacfe43a88@mail.gmail.com>
One thing to remember here is that if we require snapshotting, that
will mean paying potentially very high costs every time the
snapshotting operation is used. Potetially copying hundreds of
megabytes of data (think video).

But if we don't require snapshotting, things will only break if the
user takes the action to modify a file after giving the page access to
it.

Also, in general snapshotting is something that UAs can experiment
with without requiring changes to the spec. Even though File.slice is
a synchronous function, the UA can implement snapshotting without
using synchronous IO. The UA could simply do a asynchronous file copy
in the background. If any read operations are performed on the slice
those could simply be stalled until the copy is finished since reads
are always asynchronous.

/ Jonas

On Thu, Jan 21, 2010 at 11:22 AM, Eric Uhrhane <ericu@google.com> wrote:
> On Thu, Jan 21, 2010 at 11:15 AM, Jian Li <jianli@chromium.org> wrote:
>> Treating blobs as snapshots sounds like a reasonable approach and it will
>> make the life of the chunked upload and other scenarios easier. Now the
>> problem is: how do we get the blob (snapshot) out of the file?
>> 1) We can still keep the current relationship between File and Blob. When we
>> slice a file by calling File.slice, a new blob that captures the current
>> file size and modification time is returned. The following Blob operations,
>> like slice, will simply inherit the cached size and modification time. When
>> we access the underlying file data in XHR.send() or FileReader, the
>> modification time will be verified and an exception could be thrown.
>
> This would require File.slice to do synchronous file IO, whereas
> Blob.slice doesn't do that.
>
>> 2) We can remove the inheritance of Blob from File and introduce
>> File.getAsBlob() as dimich suggested. This seems to be more elegant.
>> However, it requires changing the File API spec a lot.
>>
>> On Wed, Jan 20, 2010 at 3:44 PM, Eric Uhrhane <ericu@google.com> wrote:
>>>
>>> On Wed, Jan 20, 2010 at 3:23 PM, Dmitry Titov <dimich@chromium.org> wrote:
>>> > On Wed, Jan 20, 2010 at 2:30 PM, Eric Uhrhane <ericu@google.com> wrote:
>>> >>
>>> >> I think it could.  Here's a third option:
>>> >> Make all blobs, file-based or not, just as async as the blobs in
>>> >> option 2.  They never do sync IO, but could potentially fail future
>>> >> read operations if their metadata is out of date [e.g. reading beyond
>>> >> EOF].  However, expose the modification time on File via an async
>>> >> method and allow the user to pass it in to a read call to enforce
>>> >> "fail if changed since this time".  This keeps all file accesses
>>> >> async, but still allows for chunked uploads without mixing files
>>> >> accidentally.  If we allow users to refresh the modification time
>>> >> asynchronously, it also allows for adding a file to a form, changing
>>> >> the file on disk, and then uploading the new file.  The user would
>>> >> look up the mod time when starting the upload, rather than when the
>>> >> file's selected.
>>> >
>>> > It would be great to avoid sync file I/O on calls like Blob.size. They
>>> > would
>>> > simply return cached value. Actual mismatch would be detected during
>>> > actual
>>> > read operation.
>>> > However then I'm not sure how to keep File derived from Blob, since:
>>> > 1) Currently, in FF and WebKit File.fileSize is a sync I/O that returns
>>> > current file size. The current spec says File is derived from Blob and
>>> > Blob
>>> > has Blob.size property that is likely going to co-exist with
>>> > File.fileSize
>>> > for a while, for compat reasons. It's weird for file.size and
>>> > file.fileSize
>>> > to return different things.
>>>
>>> True, but we'd probably want to deprecate file.fileSize anyway and
>>> then get rid of it, since it's synchronous.
>>>
>>> > 2) Currently, xhr.send(file) does not fail and sends the version of the
>>> > file
>>> > that exists somewhere around xhr.send(file) call was issued. Since File
>>> > is
>>> > also a Blob, xhr.send(blob) would behave the same which means if we want
>>> > to
>>> > preserve this behavior the Blob can not fail async read operation if
>>> > file
>>> > has changed.
>>> > There is a contradiction here. One way to resolve it would be to break
>>> > "File
>>> > is Blob" and to be able to "capture the File as Blob" by having
>>> > file.getAsBlob(). The latter would make a snapshot of the state of the
>>> > file,
>>> > to be able to fail subsequent async read operations if the file has been
>>> > changed.
>>> > I've asked a few people around in a non-scientific poll and it seems
>>> > developers expect Blob to be a 'snapshot', reflecting the state of the
>>> > file
>>> > (or Canvas if we get Canvas.getBlob(...)) at the moment of Blob
>>> > creation.
>>> > Since it's obviously bad to actually copy data, it seems acceptable to
>>> > capture enough information (like mod time) so the read operations later
>>> > can
>>> > fail if underlying storage has been changed. It feels really strange if
>>> > reading the Blob can yield some data from one version of a file (or
>>> > Canvas)
>>> > mixed with some data from newer version, without any indication that
>>> > this is
>>> > happening.
>>> > All that means there is an option 3:
>>> > 3. Treat all Blobs as 'snapshots' that refer to the range of underlying
>>> > data
>>> > at the moment of creation of the Blob. Blobs produced further by
>>> > Blob.slice() operation inherit the captured state w/o actually verifying
>>> > it
>>> > against 'live' underlying objects like files. All Blobs can be 'read'
>>> > (or
>>> > 'sent') via operations that can fail if the underlying content has
>>> > changed.
>>> > Optionally, expose snapshotTime property and perhaps "read if not
>>> > changed
>>> > since" parameter to read operations. Do not derive File from Blob,
>>> > rather
>>> > have File.getAsBlob() that produces a Blob which is a snapshot of the
>>> > file
>>> > at the moment of call. The advantage here is that it removes need for
>>> > sync
>>> > operations from Blob and provides mechanism to ensure the changing
>>> > underlying storage is detectable. The disadvantage is a bit more
>>> > complexity
>>> > and bigger change to File spec.
>>>
>>> That sounds good to me.  If we're treating blobs as snapshots, I
>>> retract my suggestion of the read-if-not-changed-since parameter.  All
>>> reads after the data has changed should fail.  If you want to do a
>>> chunked upload, don't snapshot your file into a blob until you're
>>> ready to start.  Once you've done that, just slice off parts of the
>>> blob, not the file.
>>
>>
>
Received on Thursday, 21 January 2010 20:50:21 UTC