- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 13 Dec 2001 15:34:27 +0900
- To: Michael Everson <everson@evertype.com>, Nick Kew <nick@webthing.com>
- Cc: <www-validator@w3.org>
Hello Michael, In general, I agree with Nick. A few more comments below. At 13:19 01/12/11 +0000, Michael Everson wrote: >At 20:49 +0000 2001-12-10, Nick Kew wrote: >> >>> I have a lot of pages with a few Latin 1 (non ASCII) characters in >>> them. I want to convert them all to UTF-8. This isn't always >>> straightforward. >> >>Won't iconv do it? > >What is that? A code conversion program. Doing code conversion by hand is a bad idea. It's standard on Unix, and so you probably get it when you upgrade to OS X. >> > "Sorry, I am unable to validate this document because on line 63 it >>> contained some byte(s) that I cannot interpret as utf-8. Please check >>> both the content of the file and the character encoding indication. " >> >>That'll be when the parser refuses your document outright because >>it's incompatible with your declared charset. It also means that >>the source is (technically at least) too broken even to try and >>display. > >Oh, come on! I declare UTF-8, grand. In plain text, UTF-8 looks like ASCII >with some Latin-1 characters in it in pairs on triplets. No, it doesn't. You confuse Latin-1 (a character encoding) and plain text. UTF-8 is as much plain text as anything else. >What's wrong with the Validator is that if even ONE of the UTF-8 >characters wasn't turned from a single Latin-1 character into a pair of >Latin-1 characters, then it chokes. No, it doesn't choke on characters, it chokes if there are byte sequences that do not conform to UTF-8. >Now my browsers display it easily enough. If you say in your Web page that it's UTF-8, but you have bytes corresponding to single Latin-1 characters, and your browser displays that, you should change your browser. (I suggest Netscape 6, it displays a nice black diamond with a white question mark inside (the glyph given in Unicode 3.0 at U+FFFD, for invalid byte sequences.)) >> > But the Validator is broken. The validator is a validator, not a source displaying tool. It may not have all the bells and whistles that you might want it to have, but it's not broken. >>It doesn't display the source, and so I >>> have NO IDEA how to find line 63. >> >>Erm - open your document in a text editor? > >My editors wrap lines and things. They don't number them. One can't always >see the single broken character easily. Then change your editor. There are many different text editors available for the Mac. >The point is that it is extremely useful for all the other validation >processes, where the numbered lines are listed and the little ^^ carets >show you where the error is. But on the UTF-8 check this useful source >display doesn't happen, Yes, this is because the UTF-8 check is done completely differently and before the rest of the validation. >and that's what I would like you good folks to fix. I'll accept this as a feature request and will see how easy it is to fix. But I don't think it has high priority. >>BTW: do you have a need to convert, or is this an exercise? > >Yes, I am working on converting my whole site. My showpiece is >http://www.evertype.com/standards/iso15924/document/scriptbib.html. There >is a lot more than Latin 1 in that. iconv and other conversion tools should be able to deal with that. If you want to convert numeric character references (&#X...;) to pure UTF-8, you should have a look at charlint http://www.evertype.com/standards/iso15924/document/scriptbib.html. I'm sure there is perl for the Mac. >By the way, by way of introduction, I'm one of the authors of the Unicode >Standard, Of course I know. Michael, I suggest you have a look at UTR #17, http://www.unicode.org/unicode/reports/tr17/, to help you distinguish bytes and characters :-). Regards, Martin.
Received on Thursday, 13 December 2001 01:34:36 UTC