W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2009

Is this expected behaviour?

From: Richard Fine <richard@gamedev.net>
Date: Thu, 02 Jul 2009 23:19:51 +0100
Message-ID: <4A4D3287.1020900@gamedev.net>
To: html-tidy@w3.org
Hello everyone,

I'm building an input sanitizer around Charles Reitzel's Tidy.NET 
bindings; I'm using it to take potentially-malformed XHTML + 
proprietary-namespaced tag soup and produce some kind of valid XML from 
it. It's working OK so far but I'm a bit surprised by one of my 
testcases - the output is valid XML but it's not what I was expecting it 
to be.

 From the default options, I explicitly turn on:

input-xml
output-xml
force-output

and give it the input string:

<html xmlns="http://www.w3.org/1999/xhtml"><body><b>Hello, 
<i<i>world!</b></body></html>

Note the '<i<i>' construct before 'world.' I was expecting the output:

<html xmlns="http://www.w3.org/1999/xhtml"><body><b>Hello, 
&lt;i<i>world!</i></b></body></html>

whereby the first < in the <i<i> is encoded as an entity. Instead, what 
I'm getting is:

<html xmlns="http://www.w3.org/1999/xhtml"><body><b>Hello, 
<i i="">world!</i></b></body></html>

the <i<i> is becoming <i i="">.  How come?

(I have a suspicion that the TidyATL/Tidy.NET packages on Charles' page 
are a bit out of date - they're certainly missing some of the more 
recent options - so if this is a bug that's already been fixed or 
something, I apologise for wasting your time...)

Thanks in advance,

- Richard
Received on Friday, 3 July 2009 09:33:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:59 GMT