- From: Glenn Maynard <glenn@zewt.org>
- Date: Wed, 15 Aug 2012 23:38:56 -0500
- To: Jonas Sicking <jonas@sicking.cc>
- Cc: whatwg@whatwg.org, Andrea Marchesini <baku@mozilla.com>
On Wed, Aug 15, 2012 at 10:10 PM, Jonas Sicking <jonas@sicking.cc> wrote: > Though I still think that we should support reading out specific files > using a filename as a key. I think a common use-case for ArchiveReader > is going to be web developers wanting to download a set of resources > from their own website and wanting to use a .zip file as a way to get > compression and packaging. In that case they can easily either ensure > to stick with ASCII filenames, or encode the names in UTF8. > That's what this was for: // For convenience, add "getter File? (DOMString name)" to FileList, to find a file by name. This is equivalent // to iterating through files[] and comparing .name. If no match is found, return null. This could be a function // instead of a getter. var example_file2 = zipFile.files["file.txt"]; if(example_file2 == null) { console.error("file.txt not found in ZIP"; return; } I suppose a named getter isn't a great idea--you might have a filename "length"--so a "zipFile.files.find('file.txt')" function is probably better. By allowing them to download a .zip file, they can also store that > .zip in compressed form in IndexedDB or the FileSystem API in order to > use less space on the user's device. (Additionally many times IO gets > faster by using .zip files because the time saved in doing less IO is > larger than the time spent decompressing. Obviously very dependent on > what data is being stored). > There's also the question of when decompression happens--you don't want to decompress the whole thing in advance if you can avoid it, since if the user isn't doing random access you can stream the decompression--but that's just QoI, of course. One way we could support this would be to have a method which allows > getting a list of meta-data about each entry. Probably together with > the File object itself. So we could return an array of objects like: > > [ { > rawName: <UInt8Array>, > file: <File object>, > crc32: <UInt8Array> > }, > { > rawName: <UInt8Array>, > file: <File object>, > crc32: <UInt8Array> > }, > ... > ] > > That way we can also leave out the crc from archive types that doesn't > support it. > This means exposing two objects per file. I'd prefer a single File-subclass object per file, with any extra metadata put on the subclass. > > This is definitely an interesting idea. The current API is designed > around doing the IO when each individual operation is done. You are > proposing to do all IO up front which allows all operations to be > synchronous. > > I suspect that doing the IO "lazily" can provide better performance > for some types of operations, such as only wanting to extract a single > resource from an archive. But maybe the difference wouldn't be that > big in most cases. > I'd expect the I/O savings to be negligible, since ZIP has a central directory at the end, allowing the whole thing to be read very quickly. I hope creating an array of File objects (even thousands of them) isn't too expensive. Even if it is, though, this could be refactored to still give a synchronous interface: store the file directory natively (in a non-File, non-GC'd way), and allow looking up and iterating that list in a way that only instantiates one File object at a time. (This would lose the FileList API compatibility with <input type=file>, though, which I think is a nice plus.) But I like this approach a lot of we can make it work. The main thing > I'd be worried about, apart from the IO performance above, is if we > can make it work for a larger set of archive formats. Like, can we > make it work for .tar and .tar.gz? I think we couldn't but we would > need to verify. > It wouldn't handle it very well, but the original API wouldn't, either. In both cases, the only way to find filenames in a TAR--whether it's to search for one or to construct a list--is to scan through the whole file (and decompress it all, for .tgz). Simply retrieving a list of filenames from a large .tgz would thrash the user's disk and chew CPU. I don't think there's much use in supporting .tar, anyway. Even if you want true streaming (which would be a different API anyway, since we're reading from a Blob here), ZIP can do that too, by using the local file headers instead of the central directory. -- Glenn Maynard
Received on Thursday, 16 August 2012 04:39:24 UTC