Is this expected behaviour?

Hello everyone,

I'm building an input sanitizer around Charles Reitzel's Tidy.NET 
bindings; I'm using it to take potentially-malformed XHTML + 
proprietary-namespaced tag soup and produce some kind of valid XML from 
it. It's working OK so far but I'm a bit surprised by one of my 
testcases - the output is valid XML but it's not what I was expecting it 
to be.

 From the default options, I explicitly turn on:

input-xml
output-xml
force-output

and give it the input string:

<html xmlns="http://www.w3.org/1999/xhtml"><body><b>Hello, 
<i<i>world!</b></body></html>

Note the '<i<i>' construct before 'world.' I was expecting the output:

<html xmlns="http://www.w3.org/1999/xhtml"><body><b>Hello, 
&lt;i<i>world!</i></b></body></html>

whereby the first < in the <i<i> is encoded as an entity. Instead, what 
I'm getting is:

<html xmlns="http://www.w3.org/1999/xhtml"><body><b>Hello, 
<i i="">world!</i></b></body></html>

the <i<i> is becoming <i i="">.  How come?

(I have a suspicion that the TidyATL/Tidy.NET packages on Charles' page 
are a bit out of date - they're certainly missing some of the more 
recent options - so if this is a bug that's already been fixed or 
something, I apologise for wasting your time...)

Thanks in advance,

- Richard

Received on Friday, 3 July 2009 09:33:06 UTC