RE: Zip/Unzip - the Minimalist Version for EPUB from Toman, Vojtech on 2014-06-04 (public-xml-processing-model-wg@w3.org from June 2014)

From: Toman, Vojtech <vojtech.toman@emc.com>
Date: Wed, 4 Jun 2014 04:57:17 -0400
To: XProc WG <public-xml-processing-model-wg@w3.org>
Message-ID: <F3C7EBECE80AC346BE4D1C5A9BB4A41F3006DA07C9@MX11A.corp.emc.com>
I like the simplicity of this proposal. The only reservation I have is that - unless I missed something - it sacrifices the ability to create zip archives directly from documents flowing through the pipeline: you have to export everything to the file system (or a location that can be represented by a URI) before applying p:zip. But on a second thought, maybe it is the correct way to go: it surely makes p:zip much simpler as it does not have to deal with XML serialization of the input documents, base64 encoding, base URIs etc. It is these things that are the main source of complexity (both from the specification and usage point of view) in the previous versions of p:zip, IMHO.

Some other random thoughts:
- How can you control the compression level etc. in p:zip?
- What if p:unzip had a non-primary output port that would always contain the manifest, instead of having the manifest-only option?
- Can p:zip-extract handle non-XML files? If so, how would that work? (I assume the user would have to tell p:zip-extract the media type of the entry or something like that.)

Regards,
Vojtech

> -----Original Message-----
> From: Alex Miłowski [mailto:alex@milowski.com]
> Sent: Tuesday, June 03, 2014 11:14 PM
> To: XProc WG
> Subject: Zip/Unzip - the Minimalist Version for EPUB
> 
> Here's a minimalist version that can address the needs of EPUB
> 
> 1. We have a zip step that just zip files and directories with control over
> compression.
> 
> <p:declare-step type="p:zip">
>      <p:input port="source" primary="true"/>
>      <p:output port="result" primary="true"/>
>      <p:option name="target"/>
>      <p:option name="brief" select="'true'"/> </p:declare-step>
> 
> The input is a c:archive element and the output is a c:archive.  If the 'brief'
> element is true, only the c:archive element is output.
> Otherwise, the full list of every entry is provided on the output.
> 
> If the 'target' option is not specified, the c:archive element must have an
> 'href' attribute.
> 
> 2. We have an unzip step that:
> 
>      * can list a manifest of what is in the zip file
>      * extract the zip locally (e.g. on disk) with the location specified via an
> option.
> 
> <p:declare-step type="p:unzip">
>      <p:output port="result" primary="true"/>
>      <p:option name="href" required="true"/>
>      <p:option name="target"/>
>     <p:option name="brief" select="'true'"/>
>      <p:option name="manifest-only" select="'true'"/> </p:declare-step>
> 
> The archive is specified via the 'href' option.  The result is extracted to the
> target location specified by the 'target' option.  If that option is not specified,
> the target is generated from the source.
> 
> The output of the step is a c:directory element.  If 'brief' is true, only the
> directory is listed.  Otherwise, every file and subdirectory is listed in the
> output.
> 
> Alternatively, if the 'manifest-only' is true, the output is a c:archive element
> listing all the entries in the zip file.  The 'target' and 'brief' options are ignored
> when 'manifest-only' is true.
> 
> 3. The manifest uses c:entry elements instead of files:
> 
> element c:archive {
>   & attribute href { text }?,
>   & attribute base { text}?,
>   c:file*
> }
> 
> element c:entry {
>   & attribute path { text },
>   & attribute modified { text },
>   & attribute size { text },
>   & attribute comment { text }?,
>   & attribute compressed { "true" | "false" }?,
>   & attribute directory { "true" | "false" }?
> }
> 
> 4. A new step p:zip-extract extracts a single entry from a zip file as the output
> of the step:
> 
> <p:declare-step type="p:zip-extract">
>      <p:input port="source" primary="true"/>
>      <p:output port="result" primary="true"/>
>      <p:option name="href" required="true"/> </p:declare-step>
> 
> The input is expected to be a single c:entry element.
> 
> We could consider allowing a c:archive element to extract multiple files.  We
> would need to provide a way to designate whether the results are outputs or
> written to local storage.
> 
> We could consider allowing a 'target' option so that the entries are extracted
> to local storage.
> 
> 5. In the future, a p:zip-modify step can handle updating or deleting entries
> as well as merging zip files.
> 
> 6. In the future, we could consider allowing directory entries to have
> inclusion/exclusion patterns for handling file inclusion.  This would allow one
> to zip only files of certain extensions within a directory.
> 
> 
> Use cases:
> 
> 1. Creating an EPUB file:
> 
> <p:zip>
>    <p:input port="source" brief="false">
>       <p:inline>
>            <c:archive href="book.epub" base="book">
>                 <c:entry path="mimetype" compressed="false"/>
>                 <c:entry path="META-INF" directory="true"/>
>                 <c:entry path="content" directory="true"/>
>            </c:archive>
>       </p:inline>
>     </p:input>
> <p:zip>
> 
>  produces (for example):
> 
>     <c:archive href="book.epub">
>          <c:entry path="mime type" compressed="false"/>
>          <c:entry path="META-INF/" directory="true"/>
>          <c:entry path="META-INF/container.xml" compressed="true"/>
>          <c:entry path="content/" directory="true"/>
>          <c:entry path="content/book.opf" compressed="true"/>
>          <c:entry path="content/book.ncx" compressed="true"/>
>          <c:entry path="content/book.xhtml" compressed="true"/>
>    </c:archive>
> 
> 2. Unpack an EPUB:
> 
>    <p:unzip href="book.epub" target="book" brief="false">
> 
>    produces (for example):
> 
>    <c:directory href="book/">
>          <c:file href="mimetype"/>
>          <c:directory href="book/META-INF/">
>             <c:file path="book/META-INF/container.xml"/>
>          </c:directory>
>          <c:directory href="book/content/">
>              <c:file href="book/content/book.opf"/>
>              <c:file href="book/content/book.ncx"/>
>              <c:file href="book/content/book.xhtml"/>
>          </c:directory>
>    </c:archive>
> 
> 3. Getting content from an EPUB file:
> 
>    <p:zip-extract href="book.epub">
>       <p:input port="source">
>           <p:inline>
>               <c:entry path="content/book.xhtml"/>
>           </p:inline>
>        </p:input>
>    </p:zip-extract>
> 
>    produces (for example):
> 
>      <html xmlns="http://www.w3.org/1999/xhtml"> ... </html>
> 
> --
> --Alex Milowski
> "The excellence of grammar as a guide is proportional to the paucity of the
> inflexions, i.e. to the degree of analysis effected by the language
> considered."
> 
> Bertrand Russell in a footnote of Principles of Mathematics
Received on Wednesday, 4 June 2014 08:58:03 UTC