Re: UTF-8

Hello Michael,

In general, I agree with Nick. A few more comments below.

At 13:19 01/12/11 +0000, Michael Everson wrote:
>At 20:49 +0000 2001-12-10, Nick Kew wrote:
>>
>>>  I have a lot of pages with a few Latin 1 (non ASCII) characters in
>>>  them. I want to convert them all to UTF-8. This isn't always
>>>  straightforward.
>>
>>Won't iconv do it?
>
>What is that?

A code conversion program. Doing code conversion by hand is
a bad idea. It's standard on Unix, and so you probably get
it when you upgrade to OS X.


>>  > "Sorry, I am unable to validate this document because on line 63 it
>>>  contained some byte(s) that I cannot interpret as utf-8. Please check
>>>  both the content of the file and the character encoding indication. "
>>
>>That'll be when the parser refuses your document outright because
>>it's incompatible with your declared charset.  It also means that
>>the source is (technically at least) too broken even to try and
>>display.
>
>Oh, come on! I declare UTF-8, grand. In plain text, UTF-8 looks like ASCII 
>with some Latin-1 characters in it in pairs on triplets.

No, it doesn't. You confuse Latin-1 (a character encoding) and
plain text. UTF-8 is as much plain text as anything else.


>What's wrong with the Validator is that if even ONE of the UTF-8 
>characters wasn't turned from a single Latin-1 character into a pair of 
>Latin-1 characters, then it chokes.

No, it doesn't choke on characters, it chokes if there are
byte sequences that do not conform to UTF-8.


>Now my browsers display it easily enough.

If you say in your Web page that it's UTF-8, but you have
bytes corresponding to single Latin-1 characters, and your
browser displays that, you should change your browser.
(I suggest Netscape 6, it displays a nice black diamond
with a white question mark inside (the glyph given in
Unicode 3.0 at U+FFFD, for invalid byte sequences.))


>>  > But the Validator is broken.

The validator is a validator, not a source displaying tool.
It may not have all the bells and whistles that you might
want it to have, but it's not broken.


>>It doesn't display the source, and so I
>>>  have NO IDEA how to find line 63.
>>
>>Erm - open your document in a text editor?
>
>My editors wrap lines and things. They don't number them. One can't always 
>see the single broken character easily.

Then change your editor. There are many different
text editors available for the Mac.


>The point is that it is extremely useful for all the other validation 
>processes, where the numbered lines are listed and the little ^^ carets 
>show you where the error is. But on the UTF-8 check this useful source 
>display doesn't happen,

Yes, this is because the UTF-8 check is done completely differently
and before the rest of the validation.


>and that's what I would like you good folks to fix.

I'll accept this as a feature request and will see
how easy it is to fix. But I don't think it has high
priority.


>>BTW: do you have a need to convert, or is this an exercise?
>
>Yes, I am working on converting my whole site. My showpiece is 
>http://www.evertype.com/standards/iso15924/document/scriptbib.html. There 
>is a lot more than Latin 1 in that.

iconv and other conversion tools should be able to deal with that.
If you want to convert numeric character references (&#X...;) to
pure UTF-8, you should have a look at charlint
http://www.evertype.com/standards/iso15924/document/scriptbib.html.
I'm sure there is perl for the Mac.


>By the way, by way of introduction, I'm one of the authors of the Unicode 
>Standard,

Of course I know. Michael, I suggest you have a look at UTR #17,
http://www.unicode.org/unicode/reports/tr17/, to help you
distinguish bytes and characters :-).


Regards,    Martin.

Received on Thursday, 13 December 2001 01:34:36 UTC