Re: Missing something basic . . ? from Alex Muir on 2010-04-21 (xproc-dev@w3.org from April 2010)

From: Alex Muir <alex.g.muir@gmail.com>
Date: Wed, 21 Apr 2010 15:12:02 +0000
To: Toman_Vojtech@emc.com
Cc: xproc-dev@w3.org
Message-ID: <i2p88b533b91004210812l591d72dbt3c45c2ae3c93509@mail.gmail.com>
Hi,

I was having trouble with the unzip function as well.

I have an xproc process not using zip loading html files via the
unparsed-text function in xslt to convert the html file into xml to process
further. I don't want to use tag-soup or tidy to clean the html to xml and
rather analyze the html content and create my own interpretation of an xml
representation of the data.

I wanted to then use ziped html files to save space although I wasn't able
tot get it working.

I was thinking that I would be able to unzip the HTML and do something
similar to the unparsed-text($input_uri, 'UTF-8') function to get the data
into xml without using the tag soup/tidy.

Is there a means to do that in xproc?

Regards
Alex




On Wed, Apr 21, 2010 at 1:30 PM, <Toman_Vojtech@emc.com> wrote:

>  Well, if you look closer at the specification of pxp:unzip (
> http://exproc.org/proposed/steps/other.html), this is actually the
> 'correct' behavior. Only if the content type is an XML content type, the
> data is returned without base64 encoding. All other content types (including
> text types) always result in base64 encoded data. I actually think this is a
> bug in the EXProc specification and that the result of pxp:unzip should be
> made consistent with what p:data does (i.e. not base64 encoding text content
> types)
>
>
>
> Regards,
>
> Vojtech
>
>
>
> *From:* Christopher Ball [mailto:christopher.r.ball@gmail.com]
> *Sent:* Wednesday, April 21, 2010 3:22 PM
> *To:* Toman, Vojtech; xproc-dev@w3.org
>
> *Subject:* RE: Missing something basic . . ?
>
>
>
> Tom,
>
>
>
> Thanks for the suggestion.
>
>
>
> Unfortunately, I forgot to mention in my original email that I had tried
> that permutation as well . . . with out getting the desired effect =(
>
>
>
> With the single quotes, the content-type gets paused through but still
> seems to be getting ignored and I end up with an output file of the
> following nature:
>
>
>
> <!-- Output Snippet -->
>
> <c:data xmlns:c="http://www.w3.org/ns/xproc-step" name=
> "1stFranklinFinancialCorp_CIK0000038723.txt" content-type="text/plain">
> LS0tLS1CRUdJTiBQUklWQUNZLUVOSEFOQ0VEIE1FU1NBR0UtLS0tLQ0KUHJvYy1UeXBlOiAyMDAx
> LE1JQy1DTEVBUg0KT3JpZ2luYXRvci1OYW1lOiB3ZWJtYXN0ZXJAd3d3LnNlYy5nb3YNCk9yaWdp
> . . . </c:data>
>
>
>
> Dare I say this is a bug? If so, I suppose a work around would be to cast
> back from base64 to string using an xPath function . . ?
>
>
>
> Thoughts?
>
>
>
> Christopher
>
>
>  ------------------------------
>
> *From:* xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] *On
> Behalf Of *Toman_Vojtech@emc.com
> *Sent:* Wednesday, April 21, 2010 3:27 AM
> *To:* xproc-dev@w3.org
> *Subject:* RE: Missing something basic . . ?
>
>
>
> Christopher,
>
>
>
> Try the following:
>
>
>
>   <cx:unzip>
>
>     …
>
>     <p:with-option name="content-type" select="'text/plain'"/>
>
>     …
>
>   </cx:unzip>
>
>
>
> (Single quotes around the text/plain value so that it is treated as a
> string and not as an XPath expression)
>
>
>
> That might help.
>
>
>
> Vojtech
>
>
>
>
>
> *From:* xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] *On
> Behalf Of *Christopher Ball
> *Sent:* Wednesday, April 21, 2010 3:20 AM
> *To:* xproc-dev@w3.org
> *Subject:* Missing something basic . . ?
>
>
>
> Hello,
>
>
>
> I am trying to process some zipped text files in xproc (leveraging a
> Calabash extension), but I am getting tripped up by base64 encoding.
>
>
>
> My first draft of the xproc is below. Unfortunately, the content-type
> option on cx:unzip seems to be getting ignored and I end up with an output
> file of the following nature:
>
>
>
> <!-- Output Snippet -->
>
> <c:data xmlns:c="http://www.w3.org/ns/xproc-step" name="InputFile1.txt"
> content-type="">LS0tLS1CRUdJTiBQUklWQUNZLUVOSEFOQ0VEIE1FU1NBR0UtLS0tLQ0KUHJvYy1UeXBlOiAyMDAx
> . . . </c:data>
>
>
>
> I am I missing the obvious . . . or trying to do the impossible?
>
>
>
> Most grateful for any feedback,
>
>
>
> Christopher
>
>
>
>
>
> <!-- xProc File -->
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
>
>                 xmlns:cx="http://xmlcalabash.com/ns/extensions"
>
>                 xmlns:c="http://www.w3.org/ns/xproc-step"
>
>                 xmlns:html="http://www.w3.org/1999/xhtml"
>
>                 name="aMeaninglessName"
>
>                 version="1.0" >
>
>
>
>     <p:input port="source">
>
>         <p:empty/>
>
>     </p:input>
>
>
>
>     <p:declare-step type="cx:unzip" version="1.0">
>
>         <p:output port="result"/>
>
>         <p:option name="href" required="true"/>
>
>         <p:option name="file"/>
>
>         <p:option name="content-type"/>
>
>     </p:declare-step>
>
>
>
>     <p:variable name="startingFileNumber" select="'1'"/>
>
>     <p:variable name="endingFileNumber" select="'1'"/>
>
>
>
>     <p:variable name="source-folder" select="'../zippedFiles/'"/>
>
>
>
>     <p:directory-list>
>
>         <p:with-option name="path" select="$source-folder">
>
>             <p:empty/>
>
>         </p:with-option>
>
>     </p:directory-list>
>
>
>
>     <p:for-each name="ZipedHTMLFile">
>
>         <p:iteration-source
>
>             select="//c:file[position() ge number($startingFileNumber) and
> position() le number($endingFileNumber)]"/>
>
>
>
>         <p:variable name="filename" select="c:file/@name"/>
>
>
>
>         <!-- Load from Zip file -->
>
>         <cx:unzip name="get-XML">
>
>             <p:with-option name="href"
> select="concat($source-folder,$filename)"/>
>
>             <p:with-option name="file"
> select="replace($filename,'.zip','.txt')"/>
>
>             <p:with-option name="content-type" select="text/plain"/>
>
>         </cx:unzip>
>
>
>
>         <p:store href="../output/processed.xml" name="store"/>
>
>
>
>     </p:for-each>
>
>
>
> </p:declare-step>
>



-- 
Alex
https://sites.google.com/a/utg.edu.gm/alex

Some Good Music
http://sites.google.com/site/greigconteh/
Received on Wednesday, 21 April 2010 15:20:51 UTC