Re: File API: Blob and underlying file changes. from Juan Lanus on 2010-01-23 (public-webapps@w3.org from January to March 2010)

From: Juan Lanus <juan.lanus@gmail.com>
Date: Sat, 23 Jan 2010 16:04:39 -0300
To: public-webapps@w3.org
Message-ID: <ae65f3f21001231104g70b3a814j9b75c0e5de0fc944@mail.gmail.com>
I'm new to this list and to all the W3C work so I might be completely
wrong. That said, let's say.

Dmitry posed a simple question: If a file's blob should be kept in
sync with the file's content in disk, or not. He did not get a "yes"
or "no" answer but instead triggered a near 30 posts thread that as I
see it denotes a certain lack of definition so far.
This is what I think, after having read only the draft and this thread:

** The "mutating blob"
The idea of keeping the disk file in sync with its working version,
the "mutating blob", as too risky and impractical. IMO doing so will
raise a lot of issues while solving none. What is the scenario that
calls for such feature? I can't see any, buy I can yes see lots of
scenarios where data stability is desirable.
For example a disk file holding the data of an active relational
database. The scenario is uploading a big file where possibly many
concurrent applications introduce changes anywhere in the file, every
few seconds. I know that this example is contrived, but there might be
many others with similar characteristics, albeit not so clear and
dramatic. In this scenario the UA might be completely busy in trying
to keep current with the changes, like during a DoS attack.
Another requirement for a database file is that it has to be
consistent, so sending a slice of one version lumped with a slice of a
later version is unacceptable.
If, and only if, there is an unavoidable requirement for such a
feature then I strongly suggest that the API specifies a flag
informing the application that the original file changed during the
operation but without doing nothing.
Let the developer decide if she wants to take any action, instead of
trying in advance to solve her a problem that might not exist. In one
post Dmitry says that he found out that "developers expect Blob to be
a 'snapshot'". This is the way to go: talking with developers and also
with software architects who already solved issues like this years
ago.

** Locking
What's wrong with file locking? May be it was discussed in prior
sessions I didn't read, because it seems to be already discarded.
But locking is the universally accepted solution in multitasking
operating systems. The API should lock the files to prevent them to be
written by other applications, for a short while or during a long time
It is a must, to make the read atomic (atomic is not desirable but a must):
    1 the UA SHOULD lock the file (a mandatory lock preventing writes
by other apps) and open it
    . 1a the file refuses to be locked
    .. 1a1 the operation fails with a "file is locked" error
    .. 1a2 the use case fails
    2 the UA uses the file
    3 the UA unlocks the file by issuing a close method
For small files this does not make a difference. But what happens if
the file is huge? In this case leave the problem to the developer, the
one who knows about the environment and the particular requirements.
For example the developer could choose to swiftly copy the file into a
blob and close it to release the brief lock if it is a busy file
(database ...), or have if locked during a lengthy transference
operation if the file content is static (video, or backup ...).
It is not possible to solve all the developer's issues at this point,
we can only provide tools, the simpler the better, for the developers
to leverage.
For very special cases there might be an option locking="no" to open a
file allowing other applications to change it.
Intuitively I perceive this as a security crack. Such a file could
become a communicacion area between the computer contents and the web.
A trojan could repeatedly paste information in the file for the UA to
send it to the bad guy's server.
This could be achieved by setting a trojan listener in the OS to
detect when the user selected a file.
As I see it when the user allows the UA to grab a file then she means
"what the file contains right now" and we MUST not deceive her.

** Avoid involving technology limitations in the design
The File API is sort of an impedance adapter between the latency of
the Internet connections and the speed of disk drives (disks or
whatever, think of the future).
As such, it must be able to handle any speed difference. In the future
the case difference might change its sign.
Also, the API must consider that what today is regarded as "big" might
be "regular" in the future and "small" after a while. For example
making a memory copy of a 300MB file is possible today but not when
the computers, even the mainframes, sported a few MB RAM.
The "virtual memory" that most OSs have is an existing implementation
of a in-memory file backed by disk storage. This issue is already
solved, since the seventies. A program, like the UA, can pump lots of
data into RAM and the OS will use the disk to store the bytes in case
of a shortage in real RAM. This way computers, like PCs, appear to
have twice as much RAM as they have physically installed, at the cost
of some performance loss that is completely compatible with Internet
latency. Many PCs built today have 2 trhru 4GB of real RAM, so they
appear to have 4 to 8GB providing lots of headroom to manage somehow
big files. It the files  to handle were bigger, then it's the
developer's responsibility to manage the issue, for example telling
the user not to upload a file.
It should not a "making a copy" vs. "using the original data" issue.
These are different scenarios and which to use should be up to the
developer to knows the requirements and the environment of her
application.
In the original RFC1867 specification if the user uploaded a very big
file she had to wait for ages for the whole file to upload before
getting an error message. The File API comes to the rescue, allowing
the UA to say so before the upload.

** The scenario I'm working in
My scenario for the blob functionality is related to image uploading.
The UI gets a bunch of files containing images of various sizes.
Somehow the client (UA) resizes them to fit some web application
limits before the upload like limiting width to 800px and lowering
quality so the size is below 100K, and to do so it stores each image
in a blob.
The user is looking at the image in the UI during this process and she
does not expect it to change due to local file action. If she wants a
new version of some image then she "reloads" them. For example she
modifies the image colors and saves a new version.

** This is the only chance to make changes to the API
Changing the API should not be a bounding constraint at this stage.
Joshua Bloch, regarded as the most important API designer says that
"Public APIs are forever - one chance to get it right" in this
context: http://lcsd05.cs.tamu.edu/slides/keynote.pdf
If it is possible to change it for better it must be done before it's
too late. Else millions of developers in the future will lose parts of
their lives struggling against definitions like the DOM differences
that literally swallowed entire lives in terms of time spent doing
avoidable work. Bloch publicly shares his knowledge about API design,
the documents are findable searching for "Joshua Bloch API design". I
apologize if this information is too obvious for someone, I bring it
here because I consider it's valuable to keep it in mind, al least for
me.
For example JB encourages doing the design based in user needs and use
cases, and I was unable to find any formal such document (I recognize
that I used limited time in my search). He says "Gather
Requirements–with a Healthy Degree of Skepticism", meaning that the
users MAY propose solutions but that the last work MUST come from a
knowledgeable professional.  In this thread I didn't see references to
users, sauve for two postings by Dmitry.

 In short:
- Trash the mutating blob,
- lock the file when opening it (by default) and release it upon close,
- let the developer decide about copying the file in memory or slowly
read it and feed it to the upload,
- let's not design for technological limitations, and
- let's make it right at the first and only attempt.

Respectfully,
--
Juan Lanus
Globant
Received on Saturday, 23 January 2010 19:05:38 UTC