W3C home > Mailing lists > Public > xproc-dev@w3.org > February 2010

Re: Dealing with encoding

From: Stefanie Haupt <st.haupt@gmail.com>
Date: Thu, 18 Feb 2010 09:16:40 +0100
To: Conal Tuohy <ctuohy@unimelb.edu.au>
Cc: xproc-dev@w3.org
Message-Id: <1266481000.7036.17.camel@stefanie-laptop>
Hi Con,

many thanks for your reply and giving me insight where I go wrong.

I learned that when you provide p:data with information about the
charset the file won't load if it doesn't match. So with try/catch it's
possible to get hold of the 'baddies' (not all files are encoded
windows-1252). I hoped for such a mechanism in http-request.

If I understand you correctly when I specify a charset at the step
p:unescape markup I *should* know the proper encoding for the file at
that state so I think my question is: is there a way using xproc to
identify the charset when processing multiple input files? Especially
when there is no information in metadata of html or this information is
wrong.

Many thanks in advance,
Stefanie
 

Am Donnerstag, den 18.02.2010, 08:31 +1100 schrieb Conal Tuohy:
> Hi Stephanie
> 
> I think you've misinterpreted p:http-request/@encoding, actually.
> 
> If you know your HTML files use windows-1252, I suggest you http-request
> them as binary files (which you will receive as base64-encoded
> bytestreams), and then pass the result to p:unescape-markup, specifying a
> charset at that time.
> 
> Incidentally, Calabash uses tagsoup to parse HTML, so you may well not
> need html tidy at all.
> 
> Cheers
> 
> Con
> 
> 
> > Hi all,
> >
> > I have some messy encoded HTML data which I want to process in a first
> > step with html tidy and then do some more operations controlled by a
> > xproc pipeline. Since it's more than one file I understand I use
> > p:http-request in combination with file protocol (since it's local
> > data).
> > So I thought of using try/catch but the try group part either is ignored
> > or never true as the catch part is invoked for all files. Can you please
> > have a look and tell me what I'm doing wrong here?
> >
> > I'm using Calabash from within <oXygen/> XML Editor 11.1, build
> > 2009121712 on Linux (Ubuntu).
> >
> > <p:try>
> >   <p:group>
> >     <p:http-request encoding="windows-1252"/>
> >     <p:exec command="/usr/bin/tidy" source-is-xml="false"
> >           result-is-xml="true" wrap-result-lines="false"
> >           encoding="windows-1252">
> >       <p:with-option name="args" select="'--quiet yes --show-warnings no
> > --output-xml yes --bare yes --doctype omit --numeric-entities yes
> > --char-encoding win1252'"/>
> >     </p:exec>
> >     <p:exec name="iconv" command="/usr/bin/iconv" result-is-xml="true"
> > source-is-xml="true" wrap-result-lines="false"
> >               encoding="windows-1252">
> >       <p:with-option name="args" select="'-f WINDOWS-1252 -t UTF-8'"/>
> >     </p:exec>
> >   </p:group>
> >
> >   <p:catch>
> >     <p:http-request/>
> >     <p:exec command="/usr/bin/tidy" source-is-xml="false"
> >           result-is-xml="true" wrap-result-lines="false">
> >        <p:with-option name="args" select="'--quiet yes --show-warnings
> > no --output-xml yes --bare yes --doctype omit --numeric-entities yes
> > --char-encoding utf8'"/>
> >     </p:exec>
> >  </p:catch>
> > </p:try>
> >
> > Many thanks for your help!
> > Stefanie
> >
> >
> >
> >
> 
> 
-- 
Stefanie Haupt
Received on Thursday, 18 February 2010 08:17:17 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 18 February 2010 08:17:19 GMT