Re: HTML tidy service (and non-latin/utf-8 pages)

Dmitry Baranovskiy wrote:
> Hi Dan,
> 
> I was using your wonderful service to convert HTML to XML for my new 
> microformats' parser as well as Brian Suda uses in his amazing X2V 
> parser. In feedback I received that it doesn't work properly with UTF 
> encoded page, i.e. it is unsuitable for not "Latin" people.

Hmm... how unfortunate... let's see... looking at the source...

http://dev.w3.org/cvsweb/2000/tidy-svc/

I see some changes in November 2005 around latin-1 and utf-8...
I'm not able to see at a glance exactly how they work, but
they seem to look at the charset of the source document...

> 
> Check this out:    the page: http://www.sup.com/management.html

That looks right...

Content-Type: text/html; charset=UTF-8


>                 the result: 
> http://cgi.w3.org/cgi-bin/tidy?docAddr=http%3A%2F%2Fwww.sup.com%2Fmanagement.html 

Ugh... what a mess. Odd... firefox says it's UTF-8, but clearly
it got mangled somehow, even with the "force XML" option set.

I'm not sure what's going on. Dom might have some clues, but he's
probably not in the office for the next month or so.

I'm not sure if the problem is in the tidy.py script, or in tidy
itself, or something else.

Here's hoping for some time to study it further... or some help
from somebody else who can spot the problem.

> Any chance to fix this problem? It makes my work useless as well as work 
> of Brian.
> 
> Thank you,
> Dmitry Baranovskiy
> http://dmitry.baranovskiy.com/


-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/

Received on Wednesday, 19 September 2007 14:57:01 UTC