- From: <Toman_Vojtech@emc.com>
- Date: Fri, 17 Jul 2009 03:06:25 -0400
- To: <xproc-dev@w3.org>
- Message-ID: <6E216CCE0679B5489A61125D0EFEC78710367C96@CORPUSMX10A.corp.emc.com>
Hi, You should be able to specify the charset using either the "charset" option or in the "content-type". Both should work. However, note the charset information is only used when the input data is base64 encoded. If it isn't, the charset information is ignored by p:unescape-markup. Suppose you have the following HTML document (kun.html, windows-1252): <html> <head><title>Pøíli¹ ¾lu»ouèký kùò úpìl ïábelské ódy</title></head> <body>...</body> </html> Because HTML is generally not well-formed XML, you will need to use p:data to load the HTML document in p:unescape-markup. I see at least two possibilities how to get the result you want: 1. Make sure that p:data base64 encodes the data and p:unescape-markup then decodes it using appropriate charset (windows-1252): <p:unescape-markup content-type="text/html" encoding="base64" charset="windows-1252"> <p:input port="source"> <p:data href="kun.html" content-type="application/octet-stream"/> </p:input> </p:unescape-markup> (Notice that the p:data binding specifies "application/octet-stream" as the content-type. This is to ensure the HTML document will be treated as binary data and therefore base64 encoded.) 2. In p:data, load the HTML document with the "text/html; charset=windows-1252" content-type (in this case, the result will not be base64 encoded because p:data will convert the HTML into a sequence of Unicode characters), and then process it in p:unescape-markup: <p:unescape-markup content-type="text/html"> <p:input port="source"> <p:data href="/home/vojtech/kun.html" content-type="text/html; charset=windows-1252"/> </p:input> </p:unescape-markup> I have tested both ways in Calumet, and in both cases I got the same result: <?xml version="1.0" encoding="UTF-8"?> <c:data content-type="text/html (or application/octet-stream in case 1)" xmlns:c="http://www.w3.org/ns/xproc-step"><html> <head><title>Pøíli¹ ¾lu»ouèký kùò úpìl ïábelské ódy</title></head> <body>...</body> </html></c:data> Hope this helps. Vojtech ________________________________ From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On Behalf Of lists@chiborg.otherinbox.com Sent: Thursday, July 16, 2009 7:34 PM To: xproc-dev@w3.org Subject: Is the charset supported with unescape-markup? Hello, I've tried to use unescape-markup to clean up some ugly HTML code. I have Tagsoup installed and in my classpath. The step itself works fine except one detail: The non-ascii characters. The HTML file is windows-1252 encoded. I've tried to use the "charset" attribute, I've tried to do 'content-type="text/html; charset=windows-1252"', but nothing helped. Is the charset attribute working? Am I doing something wrong? I' using the newest calabash release with the following XProc:
Received on Friday, 17 July 2009 07:08:02 UTC