W3C home > Mailing lists > Public > xproc-dev@w3.org > July 2009

RE: Is the charset supported with unescape-markup?

From: <Toman_Vojtech@emc.com>
Date: Fri, 17 Jul 2009 03:06:25 -0400
Message-ID: <6E216CCE0679B5489A61125D0EFEC78710367C96@CORPUSMX10A.corp.emc.com>
To: <xproc-dev@w3.org>
You should be able to specify the charset using either the "charset" option or in the "content-type". Both should work. However, note the charset information is only used when the input data is base64 encoded. If it isn't, the charset information is ignored by p:unescape-markup.
Suppose you have the following HTML document (kun.html, windows-1252):
  <head><title>Pli luouk k pl belsk dy</title></head>
Because HTML is generally not well-formed XML, you will need to use p:data to load the HTML document in p:unescape-markup. I see at least two possibilities how to get the result you want:
1. Make sure that p:data base64 encodes the data and p:unescape-markup then decodes it using appropriate charset (windows-1252):
<p:unescape-markup content-type="text/html" encoding="base64" charset="windows-1252">
  <p:input port="source">
    <p:data href="kun.html" content-type="application/octet-stream"/>
(Notice that the p:data binding specifies "application/octet-stream" as the content-type. This is to ensure the HTML document will be treated as binary data and therefore base64 encoded.)
2. In p:data, load the HTML document with the "text/html; charset=windows-1252" content-type (in this case, the result will not be base64 encoded because p:data will convert the HTML into a sequence of Unicode characters), and then process it in p:unescape-markup:
<p:unescape-markup content-type="text/html">
  <p:input port="source">
    <p:data href="/home/vojtech/kun.html" content-type="text/html; charset=windows-1252"/>

I have tested both ways in Calumet, and in both cases I got the same result:
<?xml version="1.0" encoding="UTF-8"?>
<c:data content-type="text/html (or application/octet-stream in case 1)" xmlns:c="http://www.w3.org/ns/xproc-step"><html>
  <head><title>Pli luouk k pl belsk dy</title></head>
Hope this helps.


	From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On Behalf Of lists@chiborg.otherinbox.com
	Sent: Thursday, July 16, 2009 7:34 PM
	To: xproc-dev@w3.org
	Subject: Is the charset supported with unescape-markup?
	Hello, I've tried to use unescape-markup to clean up some ugly HTML code. I have Tagsoup installed and in my classpath. The step itself works fine except one detail: The non-ascii characters. The HTML file is windows-1252 encoded. I've tried to use the "charset" attribute, I've tried to do 'content-type="text/html; charset=windows-1252"', but nothing helped. Is the charset attribute working? Am I doing something wrong? I' using the newest calabash release with the following XProc: 
Received on Friday, 17 July 2009 07:08:02 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:16:48 UTC