W3C home > Mailing lists > Public > xproc-dev@w3.org > July 2009

RE: Is the charset supported with unescape-markup?

From: <Toman_Vojtech@emc.com>
Date: Fri, 17 Jul 2009 03:06:25 -0400
Message-ID: <6E216CCE0679B5489A61125D0EFEC78710367C96@CORPUSMX10A.corp.emc.com>
To: <xproc-dev@w3.org>
Hi,
 
You should be able to specify the charset using either the "charset" option or in the "content-type". Both should work. However, note the charset information is only used when the input data is base64 encoded. If it isn't, the charset information is ignored by p:unescape-markup.
 
Suppose you have the following HTML document (kun.html, windows-1252):
 
<html>
  <head><title>Pli luouk k pl belsk dy</title></head>
  <body>...</body>
</html>
 
Because HTML is generally not well-formed XML, you will need to use p:data to load the HTML document in p:unescape-markup. I see at least two possibilities how to get the result you want:
 
1. Make sure that p:data base64 encodes the data and p:unescape-markup then decodes it using appropriate charset (windows-1252):
 
<p:unescape-markup content-type="text/html" encoding="base64" charset="windows-1252">
  <p:input port="source">
    <p:data href="kun.html" content-type="application/octet-stream"/>
  </p:input>
</p:unescape-markup>
 
(Notice that the p:data binding specifies "application/octet-stream" as the content-type. This is to ensure the HTML document will be treated as binary data and therefore base64 encoded.)
 
2. In p:data, load the HTML document with the "text/html; charset=windows-1252" content-type (in this case, the result will not be base64 encoded because p:data will convert the HTML into a sequence of Unicode characters), and then process it in p:unescape-markup:
 
<p:unescape-markup content-type="text/html">
  <p:input port="source">
    <p:data href="/home/vojtech/kun.html" content-type="text/html; charset=windows-1252"/>
  </p:input>
</p:unescape-markup>

I have tested both ways in Calumet, and in both cases I got the same result:
 
<?xml version="1.0" encoding="UTF-8"?>
<c:data content-type="text/html (or application/octet-stream in case 1)" xmlns:c="http://www.w3.org/ns/xproc-step"><html>
  <head><title>Pli luouk k pl belsk dy</title></head>
  <body>...</body>
</html></c:data>
 
Hope this helps.
 
Vojtech


________________________________

	From: xproc-dev-request@w3.org [mailto:xproc-dev-request@w3.org] On Behalf Of lists@chiborg.otherinbox.com
	Sent: Thursday, July 16, 2009 7:34 PM
	To: xproc-dev@w3.org
	Subject: Is the charset supported with unescape-markup?
	
	
	Hello, I've tried to use unescape-markup to clean up some ugly HTML code. I have Tagsoup installed and in my classpath. The step itself works fine except one detail: The non-ascii characters. The HTML file is windows-1252 encoded. I've tried to use the "charset" attribute, I've tried to do 'content-type="text/html; charset=windows-1252"', but nothing helped. Is the charset attribute working? Am I doing something wrong? I' using the newest calabash release with the following XProc: 
Received on Friday, 17 July 2009 07:08:02 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 17 July 2009 07:08:03 GMT