archive module

Hello

Christian Grün (BaseX) and myself have been working on a new version of a ZIP module. This module is more generic than the existing proposal with respect to three aspects:

- It's interface allows for arbitrary archive formats and compression algorithms (e.g. ZIP, tar with gz or bzip). Hence, we have changed the name of the module to "archive".

- An archive is represented as a xs:base64Binary item (instead of a resource identified by a URI as it was before).

- Entities can only be extracted as either text or binary. In contrast to the existing proposal, the conversion to html or xdm is left to the consumer.

As a proof of concept, we have already implemented the module in BaseX and Zorba. You can find the respective modules at:

Zorba:
http://bazaar.launchpad.net/~zorba-coders/zorba/zorba_zip_module/files/head:/src/archive_module.xq and
http://bazaar.launchpad.net/~zorba-coders/zorba/zorba_zip_module/files/head:/src/archive.xsd

BaseX:
http://docs.basex.org/wiki/Archive_Module

There are still a couple of questions that need to be answered and it would be great to get your opinion:

We would like to make the support for the archive format ZIP with compression algorithms STORE and DEFLATE mandatory. All other formats or compression algorithms will probably have to be implementation dependent.  For example, Zorba's implementation is based on libarchive and allows for creating compressed tar archives. BaseX's implementation is in Java and doesn't allow for creating a tar archive but provides a way to only compress a single entry with gzip. Does this make sense?

Many archive formats or compression algorithms can be parameterized with various different options. Hence, those options need to be passed in an implementation dependent way. We are not sure how those parameters would look like, yet.

With the current interface it's not possible to extract all information out an archive in a streaming fashion. Specifically, there is one function which returns the metadata of all the entries (e.g. their names) and another set of functions that provide ways to extract a particular set of entries from the archive given the names of the entries. In order to do this, one needs to be able to seek back and forth in the archive. For example, this might not be possible if the archive comes from an HTTP resource which is too big to be materialized. There are several ways to return meta data and data at the same time but non of them seems really appealing. For example, there could be one function that returns a heterogeneous sequence alternating meta data and data but the result might be hard to process in XQuery.

Christian and I would really appreciate your comments on those modules.

Best regards

Matthias

Received on Thursday, 28 June 2012 17:27:12 UTC