W3C home > Mailing lists > Public > xproc-dev@w3.org > February 2010

Re: Dealing with encoding

From: Conal Tuohy <ctuohy@unimelb.edu.au>
Date: Thu, 18 Feb 2010 08:31:26 +1100 (EST)
To: Stefanie Haupt <st.haupt@gmail.com>
Cc: xproc-dev@w3.org
Message-id: <40396.>
Hi Stephanie

I think you've misinterpreted p:http-request/@encoding, actually.

If you know your HTML files use windows-1252, I suggest you http-request
them as binary files (which you will receive as base64-encoded
bytestreams), and then pass the result to p:unescape-markup, specifying a
charset at that time.

Incidentally, Calabash uses tagsoup to parse HTML, so you may well not
need html tidy at all.



> Hi all,
> I have some messy encoded HTML data which I want to process in a first
> step with html tidy and then do some more operations controlled by a
> xproc pipeline. Since it's more than one file I understand I use
> p:http-request in combination with file protocol (since it's local
> data).
> So I thought of using try/catch but the try group part either is ignored
> or never true as the catch part is invoked for all files. Can you please
> have a look and tell me what I'm doing wrong here?
> I'm using Calabash from within <oXygen/> XML Editor 11.1, build
> 2009121712 on Linux (Ubuntu).
> <p:try>
>   <p:group>
>     <p:http-request encoding="windows-1252"/>
>     <p:exec command="/usr/bin/tidy" source-is-xml="false"
>           result-is-xml="true" wrap-result-lines="false"
>           encoding="windows-1252">
>       <p:with-option name="args" select="'--quiet yes --show-warnings no
> --output-xml yes --bare yes --doctype omit --numeric-entities yes
> --char-encoding win1252'"/>
>     </p:exec>
>     <p:exec name="iconv" command="/usr/bin/iconv" result-is-xml="true"
> source-is-xml="true" wrap-result-lines="false"
>               encoding="windows-1252">
>       <p:with-option name="args" select="'-f WINDOWS-1252 -t UTF-8'"/>
>     </p:exec>
>   </p:group>
>   <p:catch>
>     <p:http-request/>
>     <p:exec command="/usr/bin/tidy" source-is-xml="false"
>           result-is-xml="true" wrap-result-lines="false">
>        <p:with-option name="args" select="'--quiet yes --show-warnings
> no --output-xml yes --bare yes --doctype omit --numeric-entities yes
> --char-encoding utf8'"/>
>     </p:exec>
>  </p:catch>
> </p:try>
> Many thanks for your help!
> Stefanie
Received on Thursday, 18 February 2010 13:06:14 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:03:06 UTC