- From: Stefanie Haupt <st.haupt@gmail.com>
- Date: Thu, 18 Feb 2010 09:16:40 +0100
- To: Conal Tuohy <ctuohy@unimelb.edu.au>
- Cc: xproc-dev@w3.org
Hi Con, many thanks for your reply and giving me insight where I go wrong. I learned that when you provide p:data with information about the charset the file won't load if it doesn't match. So with try/catch it's possible to get hold of the 'baddies' (not all files are encoded windows-1252). I hoped for such a mechanism in http-request. If I understand you correctly when I specify a charset at the step p:unescape markup I *should* know the proper encoding for the file at that state so I think my question is: is there a way using xproc to identify the charset when processing multiple input files? Especially when there is no information in metadata of html or this information is wrong. Many thanks in advance, Stefanie Am Donnerstag, den 18.02.2010, 08:31 +1100 schrieb Conal Tuohy: > Hi Stephanie > > I think you've misinterpreted p:http-request/@encoding, actually. > > If you know your HTML files use windows-1252, I suggest you http-request > them as binary files (which you will receive as base64-encoded > bytestreams), and then pass the result to p:unescape-markup, specifying a > charset at that time. > > Incidentally, Calabash uses tagsoup to parse HTML, so you may well not > need html tidy at all. > > Cheers > > Con > > > > Hi all, > > > > I have some messy encoded HTML data which I want to process in a first > > step with html tidy and then do some more operations controlled by a > > xproc pipeline. Since it's more than one file I understand I use > > p:http-request in combination with file protocol (since it's local > > data). > > So I thought of using try/catch but the try group part either is ignored > > or never true as the catch part is invoked for all files. Can you please > > have a look and tell me what I'm doing wrong here? > > > > I'm using Calabash from within <oXygen/> XML Editor 11.1, build > > 2009121712 on Linux (Ubuntu). > > > > <p:try> > > <p:group> > > <p:http-request encoding="windows-1252"/> > > <p:exec command="/usr/bin/tidy" source-is-xml="false" > > result-is-xml="true" wrap-result-lines="false" > > encoding="windows-1252"> > > <p:with-option name="args" select="'--quiet yes --show-warnings no > > --output-xml yes --bare yes --doctype omit --numeric-entities yes > > --char-encoding win1252'"/> > > </p:exec> > > <p:exec name="iconv" command="/usr/bin/iconv" result-is-xml="true" > > source-is-xml="true" wrap-result-lines="false" > > encoding="windows-1252"> > > <p:with-option name="args" select="'-f WINDOWS-1252 -t UTF-8'"/> > > </p:exec> > > </p:group> > > > > <p:catch> > > <p:http-request/> > > <p:exec command="/usr/bin/tidy" source-is-xml="false" > > result-is-xml="true" wrap-result-lines="false"> > > <p:with-option name="args" select="'--quiet yes --show-warnings > > no --output-xml yes --bare yes --doctype omit --numeric-entities yes > > --char-encoding utf8'"/> > > </p:exec> > > </p:catch> > > </p:try> > > > > Many thanks for your help! > > Stefanie > > > > > > > > > > -- Stefanie Haupt
Received on Thursday, 18 February 2010 08:17:17 UTC