W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2010

unicode characters not showing in output

From: asllearner <old.nabble.99.kyoto@spamgourmet.com>
Date: Mon, 15 Mar 2010 05:55:18 -0700 (PDT)
Message-ID: <27874747.post@talk.nabble.com>
To: html-tidy@w3.org
This is a little complicated but I will try to explain as clearly as I can:
THe basic problem is that when I run tidy on files with japanese utf8
characters, the output looks like garbage and I get an error invalid utf
character. I am not sure what settings I should use to make the japanese
characters be correct...
Here are the details:
I am using a japanese computer with windows xp.
I have been using editplus3 and editpadpro and as my editor and tidy-gui,
though I also have the smame problem when i use html-kit tools.
When I type japanese characters in my html file, such as this ひらがな
(hiragana) or 漢字 (kanji), I can read the characters fine in my editor, and
they appear to be in utf8 encoding (if I open as utf8 I can read the
text...). 
When I tell tidy the charset is utf8, or the input and output encoding are
utf8, I get the error.he output is garbage, as far as I can tell...
Here is a specific example
input: ひらがな ( u+hex values: U+3072 U+3089 U+304C U+306A) (hiragana)
output:a?2a??a??a?a 
warning: replacing invalid utf-8 char code U+0081
note that the utf character code in the warning is not in the actual string
that was passed!

I have tried various combinations of input and output encoding, including
raw, and with  other parameters set to defaults...if I need to be more
explicit here i will...

any help troubleshooting greatly appreciated...

thanks



-- 
View this message in context: http://old.nabble.com/unicode-characters-not-showing-in-output-tp27874747p27874747.html
Sent from the w3.org - html-tidy mailing list archive at Nabble.com.
Received on Tuesday, 16 March 2010 15:43:07 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:14:00 GMT