Re: Detailed review of 4.11. Client-side persistent storage from Ian Hickson on 2007-12-11 (public-html@w3.org from December 2007)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 11 Dec 2007 02:03:43 +0000 (UTC)
To: Mihai Sucan <mihai.sucan@gmail.com>
Cc: public-html <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0712110147050.7107@hixie.dreamhostps.com>
(This e-mail also has some replies to e-mails on the DB API.)

On Mon, 17 Sep 2007, Mihai Sucan wrote:
> 
> 1. In section 4.11.3. "The StorageItem interface" [2], I would suggest:

This section is now gone.


> a) the StorageItem objects should also have two read-only attributes: 
> dateCreated and dateModified, as Date objects (or UNIX timestamps).
> 
> One of the uses is some web applications might disconsider/purge values 
> which are too old. Currently, one would need to store two separate items 
> for this kind of tracking. I'm not asking for an dateExpires, like 
> cookies have.
> 
> Also, I'm thinking UAs which will implement persistent storage will 
> obviously internally save the dateCreate and dateModified values - 
> they'll use these two to automatically purge items which are too old 
> (such that the UA doesn't slow down too much, performance issues, and 
> privacy issues). Basically, I only want these two values exposed to the 
> web applications as well.
> 
> It doesn't really make sense to leave this out of the spec. There are 
> tons of cases where timestamps are used: files and folders on 
> filesystems have the created and/or modified date as metadata, 
> databases, tables in relational databases (like mySQL) have created and 
> modified date as metadata, emails, etc.

I think if people want to have this information, they should use the SQL 
API. Do you think this is acceptable?


> b) the StorageItem object could also have an attribute defining lastURL: 
> the absolute URL of the last page (without any query parameters) which 
> modified the value of the object.
> 
> This is just an idea - I don't consider this a requirement (as the above 
> one). It would be a nice feature.
> 
> But then ... both of the suggestions above enable even more tracking - 
> privacy concerns. Maybe enable these attributes only for secure pages?

Since we've removed shared access, and StorageItems in general, this is 
rather moot at this point. :-)


> c) Also a question: the storage event is defined just as a notification 
> which tells the potential listeners that the storage for the domain has 
> been modified. Why wasn't the storage event defined as a notification 
> which tells exactly what changed? As in, include the StorageItem object 
> itself as well. Would that be a security/privacy concern? It shouldn't 
> be: the scripts can access the StorageItem, anyway.

The concern is that there may be a lot of changes (especially with session 
storage). I'd be interested in hearing from authors who wish to use this 
API, though. What do you envisage doing with it that may require detailed 
notifications of changes across windows?


> Currently, say two web applications would need to share *several* 
> StorageItem objects. If application A changes something of interest for 
> application B, then the listener within page B would have to search 
> through the list of StorageItem objects of the domain where application 
> A resides. Only the domain is known, given the "domain" attribute 
> defined within the storage event. Also, checking what was changed is 
> even harder given there's no dateModified attribute defined for 
> StorageItem objects. If performance is an issue for both applications, 
> they would have to use cross-site messaging to notify each other about 
> the specific changes. That shouldn't be needed for simple storage 
> updates - only for complex communication between two (or more) 
> applications. Cross-site messaging would also add a lot more complexity, 
> because the involved application must have their "communication 
> protocol" defined.

I agree that it's suboptimal, but I don't want to make the API complex 
unless it's truly needed.


> 2. In section 4.11.5. "The globalStorage attribute" [3], the definition 
> of the namedItem() method [4] has a typo:
> 
> "The namedItem(domain) method tries to *returns* a Storage object 
> associated with the given domain, according to the rules that follow."
> 
> Correction: return.

This is now gone.


> 3. In section 4.11.7.1. "Disk space" [5]:
> 
> "If the storage area space limit is reached during a setItem() call, the 
> user agent should raise an exception."
> 
> This is too ambiguous. This can cause inconsistencies between 
> implementations.
> 
> I'd recommend defining that as a MUST, including which specific 
> exception to be raised.

Done.


> How are scripts supposed to work when the "disk quota is full"? That 
> should be defined in the spec.

How do you mean?


> An idea would be to have a new boolean attribute for the Storage object: 
> isWritable. This would false when "disk quota is full", or true 
> otherwise.

Typically the disk quota is never actually exactly full. e.g. the usage 
could be at 995 bytes, the quota at 1000 bytes, so adding an ASCII string 
of 4 characters could work but adding a chinese string of 4 characters 
could fail (e.g. if the system used UTF-8).


> 4. In section 4.11.8.1. "User tracking" [6], source code HTML comment:
> 
> "<!-- XXX should there be an explicit way for sites to state when
>     data should expire? as in
>     globalStorage['example.com'].expireData(365); ? -->"
> 
> I did think of this feature, while reading through the spec. I don't 
> think this is a high priority feature. It would be nice, but define it 
> such that only scripts running from example.com can use the expireData() 
> method on the Storage object. If scripts on any other domain 
> (sub.example.com or "com") try to call 
> globalStorage['example.com'].expireData(), raise a security exception.

I've left this out for now.


> 5. In section 4.11.8.4. "Cross-protocol and cross-port attacks" [7]:
> 
> "Big Issue: What about if someone is able to get a server up on a port, 
> and can then send people to that URI? They could steal all the data with 
> no further interaction. How about putting the port number at the end of 
> the string being compared? (Implicitly.)"
> 
> I strongly recommend putting the port number at the end of the string 
> being compared. My recommandation is not based only on security-related 
> concerns, but also practical concerns.
> 
> It's very wrong to assume the same application runs on a different port, 
> on the same domain. It's obviously a different web application.
> 
> Web developers (including me) commonly host multiple web 
> sites/applications on the same server, on varying port numbers. It would 
> be very confusing and annoying to have the same persistent storage 
> across different ports.
> 
> The current definition of the persistent storage is completely 
> eliminating the use of port numbers - which is very wrong.

This is now moot with the use of same-origin restrictions only.


> 6. Personally I find the overall storage idea very good. However, I also 
> find it far too "liberal" - regarding security.
> 
> Here's what I suggest, something maybe simple, yet, this is something I 
> would personally use, in many cases:
> 
> Define a third argument for the setItem() method of the Storage object. 
> Name it "private", of boolean type. If the author sets this optional 
> argument to true, then the StorageItem object is flagged as private.
> 
> StorageItem objects flagged as private will *only* be available to 
> scripts on the *same* domain (same origin), not on any subdomains, not 
> on higher-level domains. For example: if a script on "music.example.com" 
> creates a private StorageItem object named "myTest", other scripts on 
> the same domain will be able to read it. Yet, scripts which run on 
> "beta.music.example.com" or "example.com", will cause raising a security 
> exception if they try to read/write the "myTest" StorageItem object.

I've made this "private" mode the only mode.


On Fri, 12 Oct 2007, Mihai Sucan wrote:
> > 
> > We need to work out what our story is with sessionStorage and 
> > globalStorage, by the way. Having both them and the SQL storage API 
> > seems like overkill and bloat.
> 
> Yes and no.
> 
> Yes, because you can have all your global and even session storage 
> within a single sqlStorage, and vice-versa. The only fundamental 
> difference is that SQL is better suited for large amounts of work with 
> data - better data management and storage altogether. 
> globalStorage/sessionStorage are quite often preferred in Web 
> applications where there's no need for large amounts of data - only for 
> some application preferences, customizations, etc.
> 
> No, because, obviously, both are useful and... dare I say, needed.

With the new simplified globalStorage, I agree.


> Could they be combined? Probably yes.
> 
> Here's my proposal/idea/suggestion:
> 
> Change the specification, and base all the storage, on a single unified 
> approach. Let all the storage data be in SQL databases.
> 
> Keep the current definition of the client-side database storage API, 
> which allows developers to use SQL databases, and such. This the "raw" 
> access to all the storage of a domain.
> 
> However, to satisfy the needs and use-cases for globalStorage and 
> sessionStorage, define two basic tables within a single database. Let 
> the database name be the 'storage' string. Let the first table be 
> 'global'. Let the second table be 'session'. Now, define the format of 
> the two tables such that Web authors have the same features/capabilities 
> as they are now defined for globalStorage and sessionStorage APIs 
> (key/value columns, and the domain column for globalStorage).
> 
> Next, redefine the globalStorage and sessionStorage APIs to be just 
> "shorthands" to access the two SQL tables from the 'storage' database. 
> Define which SQL queries are automatically generated.

I don't want to make sessionStorage and globalStorage asynchronous, 
though, and it would be confusing if there were two ways to get to the 
same data that had radically different underlying models. I think it makes 
sense to keep the two distinct as it allows UAs to optimise them 
differently as appropriate for the kind of API they expose.


> I'd recommend to add some examples for executeSql().

Yeah we need examples throughout. Known issue.


> Binary data, as in images, executables, videos, etc. can be inserted 
> into mySQL databases (and into other SQL databases, obviously). They 
> provide several field types which can handle hundreds of megabytes, even 
> gigabytes, of binary data. See BLOBs [1]
> 
> I was asking, if future UAs should allow inserting binary data into 
> tables.
> 
> 1. I want to build a Web application which allows the user to do word 
> processing offline.
> 
> 2. I, as a Web developer, find it ideal to store all the documents 
> within the SQL storage.
> 
> 3. For the moment I have some concerns:
> 
> a) How can I allow the user to "upload" files (images, videos, sounds, 
> archives, etc) into his/her documents without actually uploading the 
> files to my server? *Offline* Web application.

There are plans afoot to add APIs to HTMLInputElement type=file to handle 
this. They are currently stalled on waiting for the forms task force to 
complete so that we don't go down a path that the W3C later decides is the 
wrong path.


> b) How do I insert all this binary data into the SQL storage? Can I have
> something like:
> executeSql('INSERT INTO `myfiles` (`name`, `data`) VALUES (?,?)',
> my_file_name, my_file_data)

I don't believe we currently have a good solution for such binary data, 
but this will likely change in time (it depends a bit on ES4).


> c) There are other concerns as well: can JavaScript engines handle 
> variables that have several megabytes of such data? Isn't that too 
> memory-intensive? Or... could we have special FileObjects which don't 
> actually have the files loaded into memory, but which can be passed to 
> SQL queries?
>
> d) Once the user goes online, how can I upload the files to the server? 
> Without actually reading the entire file from the SQL storage into 
> memory. Again, probably such fields should be some JS SqlBlobObjects 
> which don't actually contain the entire blob, but they can be passed to 
> input type=file for uploading to remote servers.
>
> e) How about streaming the binary data to the remote server with the 
> network connection API?

We'll probably have to address these when we work on forms.


(I'll reply to this e-mail again when I work on the Database part.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 11 December 2007 02:03:55 UTC