Re: [whatwg] Archive API - proposal from Glenn Maynard on 2012-08-16 (public-whatwg-archive@w3.org from August 2012)

From: Glenn Maynard <glenn@zewt.org>
Date: Wed, 15 Aug 2012 23:38:56 -0500
To: Jonas Sicking <jonas@sicking.cc>
Cc: whatwg@whatwg.org, Andrea Marchesini <baku@mozilla.com>
Message-ID: <CABirCh9EL1rJVV6vHDP5Tw6HCAjxL_FzjQ4n+R1kyZzZzv5JDw@mail.gmail.com>
On Wed, Aug 15, 2012 at 10:10 PM, Jonas Sicking <jonas@sicking.cc> wrote:

> Though I still think that we should support reading out specific files
> using a filename as a key. I think a common use-case for ArchiveReader
> is going to be web developers wanting to download a set of resources
> from their own website and wanting to use a .zip file as a way to get
> compression and packaging. In that case they can easily either ensure
> to stick with ASCII filenames, or encode the names in UTF8.
>

That's what this was for:

    // For convenience, add "getter File? (DOMString name)" to FileList, to
find a file by name.  This is equivalent
    // to iterating through files[] and comparing .name.  If no match is
found, return null.  This could be a function
    // instead of a getter.
    var example_file2 = zipFile.files["file.txt"];
    if(example_file2 == null) { console.error("file.txt not found in ZIP";
return; }

I suppose a named getter isn't a great idea--you might have a filename
"length"--so a "zipFile.files.find('file.txt')" function is probably better.

By allowing them to download a .zip file, they can also store that
> .zip in compressed form in IndexedDB or the FileSystem API in order to
> use less space on the user's device. (Additionally many times IO gets
> faster by using .zip files because the time saved in doing less IO is
> larger than the time spent decompressing. Obviously very dependent on
> what data is being stored).
>

There's also the question of when decompression happens--you don't want to
decompress the whole thing in advance if you can avoid it, since if the
user isn't doing random access you can stream the decompression--but that's
just QoI, of course.

One way we could support this would be to have a method which allows
> getting a list of meta-data about each entry. Probably together with
> the File object itself. So we could return an array of objects like:
>
> [ {
>     rawName: <UInt8Array>,
>     file: <File object>,
>     crc32: <UInt8Array>
>   },
>   {
>     rawName: <UInt8Array>,
>     file: <File object>,
>     crc32: <UInt8Array>
>   },
>   ...
> ]
>
> That way we can also leave out the crc from archive types that doesn't
> support it.
>

This means exposing two objects per file.  I'd prefer a single
File-subclass object per file, with any extra metadata put on the subclass.

>
> This is definitely an interesting idea. The current API is designed
> around doing the IO when each individual operation is done. You are
> proposing to do all IO up front which allows all operations to be
> synchronous.
>
> I suspect that doing the IO "lazily" can provide better performance
> for some types of operations, such as only wanting to extract a single
> resource from an archive. But maybe the difference wouldn't be that
> big in most cases.
>

I'd expect the I/O savings to be negligible, since ZIP has a central
directory at the end, allowing the whole thing to be read very quickly.

I hope creating an array of File objects (even thousands of them) isn't too
expensive.  Even if it is, though, this could be refactored to still give a
synchronous interface: store the file directory natively (in a non-File,
non-GC'd way), and allow looking up and iterating that list in a way that
only instantiates one File object at a time.  (This would lose the FileList
API compatibility with <input type=file>, though, which I think is a nice
plus.)

But I like this approach a lot of we can make it work. The main thing
> I'd be worried about, apart from the IO performance above, is if we
> can make it work for a larger set of archive formats. Like, can we
> make it work for .tar and .tar.gz? I think we couldn't but we would
> need to verify.
>

It wouldn't handle it very well, but the original API wouldn't, either.  In
both cases, the only way to find filenames in a TAR--whether it's to search
for one or to construct a list--is to scan through the whole file (and
decompress it all, for .tgz).  Simply retrieving a list of filenames from a
large .tgz would thrash the user's disk and chew CPU.

I don't think there's much use in supporting .tar, anyway.  Even if you
want true streaming (which would be a different API anyway, since we're
reading from a Blob here), ZIP can do that too, by using the local file
headers instead of the central directory.

-- 
Glenn Maynard
Received on Thursday, 16 August 2012 04:39:24 UTC