Multipart or TAR archive/package support for all APIs (Performance and scalability)

There has been some talk about supporting packages/archives in web APIs.

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-July/021586.html
http://lists.w3.org/Archives/Public/public-webapps/2009JulSep/0460.html

--------------

Why?

The main purpose is performance because of overhead in opening several
connections. While this could potentially be solved using HTTP pipelining
there are several advantages to working with packages in single requests.

- HTTP pipelining has various bugs in several servers and proxies.
Therefore, it's disabled by default in most(?) current browsers and several
proxies. If it's going to be usable it needs several specification changes
and updates across the board.

- Even if HTTP pipelining worked as expected, Keep-Alive connections require
that servers keeps connections open for a certain timeout period. That can
be detrimental to high performance servers. The solution is to set the
timeout so low that the client may timeout during page load - making it
worse than no pipelining.

- By packaging small files as a single unit, you can gzip the entire package
using Content-Encoding. That can have major bandwidth benefits compared to
gzipping each file individually. (.tar.gz vs .gz.tar)

- High performance servers can easily handle packaged data. It's quicker to
read a large file as a single consecutive read than making lots of look ups
and seeks to find lots of small files on disk.

- Clients can cache the package as a single unit, giving clients the same
boost on disk seeks, if a simple caching mechanism is used.

- If it's ubiquitous - it's easier for authors to package and deploy widgets
and client-side tools as single files.

--------------

How?

My suggestion would be to define the fragment part of the URI for a certain
multipart type. The fragment identifier denotes a certain file within the
package. E.g. http://domain/archive#filename This is similar to fragment's
use for rows in text/plain (rfc5147 <http://tools.ietf.org/html/rfc5147>),
anchors in text/html (rfc2854 <http://www.ietf.org/rfc/rfc2854.txt>), etc.

The idea is that you could reference a single file within an archive in any
other web API. The UA would download the archive and load the file when it
reaches a file with said identifier within that archive.

The packaging format could be any existing format: application/tar (using
filenames), multipart/form-data (using the name attribute in
Content-Disposition part-header) or multipart/related (using Content-ID
part-header). But it's probably good to settle on one.

The identifier fragment can itself have an additional fragment when the
inner mime type defines a special usage: <a
href="archive#file.html#anchorname"> or any other place where you need a
fragment to define behavior (SVG, XBL, etc). Multiple # should be fine
according to the generic uri syntax
(rfc3986<http://tools.ietf.org/html/rfc3986>).
Does it break any other existing specs or implementations?

--------------

Compatibility?

Additionally you could add an additional attribute to HTML5 and CSS for
archive URLs. That way, compatible UAs can use the package, if supported,
otherwise fallback to regular files. Perhaps you could use media types using
nested mimes: <audio src="archive#audiofile" type="multipart/related;
fragmenttype=audio/ogg" />

Example usage:

<img src="file.jpg" msrc="archive.tar#file.jpg" />


> {

background-image: url(file.jpg);

background-image: murl(archive.tar#file.jpg);

}


> <script src="file.js" msrc="archive.tar#file.js" type="text/javascript" />


> var img = new Image();

img.msrc = "archive.tar#file.png";



xhr.open("GET", "archive.tar#file.xml", true);


-----------------

The purpose of this suggestion is that it is a rather easy specification.
It's a minor tweak that would open up many possibilities using existing
tools. It may not be so minor for implementations though. I'd love to hear
other suggestions on how to best to address this issue.

Received on Tuesday, 4 August 2009 19:31:18 UTC