- From: Imsieke, Gerrit, le-tex <gerrit.imsieke@le-tex.de>
- Date: Fri, 06 Jan 2012 03:46:48 +0100
- To: xproc-dev@w3.org
These are suggestions for facilitating zip and text processing: I'd appreciate if zip input became a first-class citizen, i.e. if the pxp:unzip step was being promoted to the standard step library. In addition to what pxp:unzip provides, it would be nice to have an option that, as a side effext, the unzip step expands the whole zip file to some filesystem location, so that subsequent steps can work on the expanded content. The two shortcomings of the current pxp:unzip are that it can only extract selected files and that text (like every other non-XML content) will always be extracted as base64-encoded data. A field of use where XProc is really useful is checking and transforming data in zipped container formats (such as IDML, EPUB, OpenDocument, or docx). Consider the case of an EPUB that contains XHTML files that link to CSS stylesheets. We have implemented a CSS parser in XSLT2. In order to make it work without too many annoying workarounds, we have to unpack the zip file first. Otherwise we'd have to extract the OPF file, analyze it for look for spine content files, extract each one of them, analyze which CSS files they are linking, extract these CSS files, base64-decode them, wrap the CSS text in some XML element and feed the XHTML and the wrapped CSS as multiple input documents into the XSLT via the default collection mechanism. It's much more straightforward if everything was just unpacked, the spine XHTML documents could just be read from filesystem and the XSLT could pull linked stylesheets via unparsed-text() while processing each XHTML document. When dealing with text (such as CSS) that is being read from the filesystem, determining the charset may be an issue. Some text formats such as HTML5 or CSS provide mechanisms where the author can specify the charset in some format-specific way. But most of the times, the encoding has to be known in advance or determined heuristically. I'd appreciate if p:data and pxp:unzip (or rather, p:unzip) had an option for the pipeline author to specify an expected charset or, alternatively/additionally, an option for the processor to infer the text encoding. In addition, p:unzip should have an option that tells it to normalize text to UTF-8 during extraction. Or c:file (in the zipfile directory) should carry an optional charset option that indicates the inferred charset. Otherwise, you'd have to implement these heuristics in later process steps, such as trying several charsets in turn with unparsed-text() and catching the errors. If I may raise a third issue, also related to non-XML-data: it would be nice if there was a generic step for reading (raster) image information, such as pixel dimensions, density, color space, depth, etc. But this is arguably more in the extension realm than processing zip and text data, so I won't propose it as core functionality. Where does it stop? The next guy will ask for steps to resize or despeckle images. Well, we need an ImageMagick library not only for XProc, but for XPath in general. The main reason why you can do more things in, say, Python than in XQuery or XSLT is that there are more libraries available. Gerrit On 2012-01-05 14:21, James Fuller wrote: > As we review where we go from here with xproc.vnext can I ask people > on this list to comment on; > > * highlight their top 4-5 pain points using XProc from a usability > perspective. We have captured some of these here; > > http://www.w3.org/wiki/Usability > > * expand on what you think maybe useful for xproc.vnext, once again we > have captured some of this here > > http://www.w3.org/wiki/XprocVnext > > * comment on expectations for timelines on an xproc.vnext as well as > highlighting key priorities e.g. is this is a short 'fix whats broke' > or something more 'revolutionary' ? > > appreciate everyone taking time and effort on this. > > Jim Fuller >
Received on Friday, 6 January 2012 02:47:27 UTC