- From: Dan Connolly <connolly@w3.org>
- Date: Wed, 19 Sep 2007 15:57:18 +0100
- To: Dmitry Baranovskiy <dmitry.baranovskiy@gmail.com>
- Cc: dom@w3.org, www-archive@w3.org
Dmitry Baranovskiy wrote: > Hi Dan, > > I was using your wonderful service to convert HTML to XML for my new > microformats' parser as well as Brian Suda uses in his amazing X2V > parser. In feedback I received that it doesn't work properly with UTF > encoded page, i.e. it is unsuitable for not "Latin" people. Hmm... how unfortunate... let's see... looking at the source... http://dev.w3.org/cvsweb/2000/tidy-svc/ I see some changes in November 2005 around latin-1 and utf-8... I'm not able to see at a glance exactly how they work, but they seem to look at the charset of the source document... > > Check this out: the page: http://www.sup.com/management.html That looks right... Content-Type: text/html; charset=UTF-8 > the result: > http://cgi.w3.org/cgi-bin/tidy?docAddr=http%3A%2F%2Fwww.sup.com%2Fmanagement.html Ugh... what a mess. Odd... firefox says it's UTF-8, but clearly it got mangled somehow, even with the "force XML" option set. I'm not sure what's going on. Dom might have some clues, but he's probably not in the office for the next month or so. I'm not sure if the problem is in the tidy.py script, or in tidy itself, or something else. Here's hoping for some time to study it further... or some help from somebody else who can spot the problem. > Any chance to fix this problem? It makes my work useless as well as work > of Brian. > > Thank you, > Dmitry Baranovskiy > http://dmitry.baranovskiy.com/ -- Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Wednesday, 19 September 2007 14:57:01 UTC