Re: calling for xproc pain points, requested features, etc

These are suggestions for facilitating zip and text processing:

I'd appreciate if zip input became a first-class citizen, i.e. if the 
pxp:unzip step was being promoted to the standard step library.

In addition to what pxp:unzip provides, it would be nice to have an 
option that, as a side effext, the unzip step expands the whole zip file 
to some filesystem location, so that subsequent steps can work on the 
expanded content.

The two shortcomings of the current pxp:unzip are that it can only 
extract selected files and that text (like every other non-XML content) 
will always be extracted as base64-encoded data.

A field of use where XProc is really useful is checking and transforming 
data in zipped container formats (such as IDML, EPUB, OpenDocument, or 
docx).

Consider the case of an EPUB that contains XHTML files that link to CSS 
stylesheets. We have implemented a CSS parser in XSLT2. In order to make 
it work without too many annoying workarounds, we have to unpack the zip 
file first. Otherwise we'd have to extract the OPF file, analyze it for 
look for spine content files, extract each one of them, analyze which 
CSS files they are linking, extract these CSS files, base64-decode them, 
wrap the CSS text in some XML element and feed the XHTML and the wrapped 
CSS as multiple input documents into the XSLT via the default collection 
mechanism. It's much more straightforward if everything was just 
unpacked, the spine XHTML documents could just be read from filesystem 
and the XSLT could pull linked stylesheets via unparsed-text() while 
processing each XHTML document.

When dealing with text (such as CSS) that is being read from the 
filesystem, determining the charset may be an issue. Some text formats 
such as HTML5 or CSS provide mechanisms where the author can specify the 
charset in some format-specific way. But most of the times, the encoding 
has to be known in advance or determined heuristically.

I'd appreciate if p:data and pxp:unzip (or rather, p:unzip) had an 
option for the pipeline author to specify an expected charset or, 
alternatively/additionally, an option for the processor to infer the 
text encoding.

In addition, p:unzip should have an option that tells it to normalize 
text to UTF-8 during extraction. Or c:file (in the zipfile directory) 
should carry an optional charset option that indicates the inferred 
charset. Otherwise, you'd have to implement these heuristics in later 
process steps, such as trying several charsets in turn with 
unparsed-text() and catching the errors.

If I may raise a third issue, also related to non-XML-data: it would be 
nice if there was a generic step for reading (raster) image information, 
such as pixel dimensions, density, color space, depth, etc. But this is 
arguably more in the extension realm than processing zip and text data, 
so I won't propose it as core functionality. Where does it stop? The 
next guy will ask for steps to resize or despeckle images. Well, we need 
an ImageMagick library not only for XProc, but for XPath in general. The 
main reason why you can do more things in, say, Python than in XQuery or 
XSLT is that there are more libraries available.

Gerrit

On 2012-01-05 14:21, James Fuller wrote:
> As we review where we go from here with xproc.vnext can I ask people
> on this list to comment on;
>
> * highlight their top 4-5 pain points using XProc from a usability
> perspective. We have captured some of these here;
>
>       http://www.w3.org/wiki/Usability
>
> * expand on what you think maybe useful for xproc.vnext, once again we
> have captured some of this here
>
>      http://www.w3.org/wiki/XprocVnext
>
> * comment on expectations for timelines on an xproc.vnext as well as
> highlighting key priorities e.g. is this is a short 'fix whats broke'
> or something more 'revolutionary' ?
>
> appreciate everyone taking time and effort on this.
>
> Jim Fuller
>

Received on Friday, 6 January 2012 02:47:27 UTC