W3C home > Mailing lists > Public > xproc-dev@w3.org > February 2010

Re: Dealing with encoding

From: Conal Tuohy <ctuohy@unimelb.edu.au>
Date: Thu, 18 Feb 2010 18:54:54 +1100
To: xproc-dev@w3.org
Message-id: <4B7CF24E.2070502@unimelb.edu.au>
Hi Stephanie

I think you've misinterpreted p:http-request/@encoding, actually.

If you know your HTML files use windows-1252, I suggest you http-request
them as binary files (which you will receive as base64-encoded
bytestreams), and then pass the result to p:unescape-markup, specifying a
charset at that time (and encoding="base64").

Incidentally, Calabash uses tagsoup to parse HTML, so you may well not
need html tidy at all.

Cheers!

Conal

PS I posted this message this morning but it doesn't seem to have come through - I apologise if you get two copies.

Stefanie Haupt wrote:
> Hi all,
>
> I have some messy encoded HTML data which I want to process in a first
> step with html tidy and then do some more operations controlled by a
> xproc pipeline. Since it's more than one file I understand I use
> p:http-request in combination with file protocol (since it's local
> data). 
> So I thought of using try/catch but the try group part either is ignored
> or never true as the catch part is invoked for all files. Can you please
> have a look and tell me what I'm doing wrong here? 
>
> I'm using Calabash from within <oXygen/> XML Editor 11.1, build
> 2009121712 on Linux (Ubuntu).
>
> <p:try>
>   <p:group>
>     <p:http-request encoding="windows-1252"/>
>     <p:exec command="/usr/bin/tidy" source-is-xml="false"
>           result-is-xml="true" wrap-result-lines="false"
>           encoding="windows-1252">
>       <p:with-option name="args" select="'--quiet yes --show-warnings no
> --output-xml yes --bare yes --doctype omit --numeric-entities yes
> --char-encoding win1252'"/>
>     </p:exec>
>     <p:exec name="iconv" command="/usr/bin/iconv" result-is-xml="true"
> source-is-xml="true" wrap-result-lines="false"
>               encoding="windows-1252">
>       <p:with-option name="args" select="'-f WINDOWS-1252 -t UTF-8'"/>
>     </p:exec>
>   </p:group>
>
>   <p:catch>
>     <p:http-request/>
>     <p:exec command="/usr/bin/tidy" source-is-xml="false"
>           result-is-xml="true" wrap-result-lines="false">
>        <p:with-option name="args" select="'--quiet yes --show-warnings
> no --output-xml yes --bare yes --doctype omit --numeric-entities yes
> --char-encoding utf8'"/>
>     </p:exec>
>  </p:catch>
> </p:try>
>
> Many thanks for your help!
> Stefanie
>
>
>
>   
Received on Thursday, 18 February 2010 13:06:14 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 18 February 2010 13:06:22 GMT