W3C home > Mailing lists > Public > www-international@w3.org > April to June 2000

Re: greek char in UTF-8 (part 2)

From: Chris Lilley <chris@w3.org>
Date: Mon, 08 May 2000 22:35:00 +0200
Message-ID: <391724F4.9D226CC7@w3.org>
To: Guy Teasdale <Guy.Teasdale@bibl.ulaval.ca>
CC: www-international@w3.org


Guy Teasdale wrote:
> 
> Thank you to all the persons who took their time to share their thoughts
> and advices with my problem concerning the transmission of greek characters
> in utf-8.
> 
> I still have a couple of questions and comments.
> 
> First, what is the recommended Formal Public Identifier for the charset ?
> 
> Is it:
> <meta http-equiv="Content-Type" content="text/html;
> charset=UNICODE-1-1-UTF-8">
> or
> <meta http-equiv="Content-Type" content="text/html;  charset=UTF-8">  ??

The latter.

> Second, I had to abandon the idea of embedding a font using WEFT font
> embedder, because, as I understand it, it works only with Internet Explorer
> and I must insure that I have the most interoperable document, which will
> display well on any platform, anywhere.

Well it doesn't *break* any browser that doesn't support that font format.
It just ads more font choices for implementations that do support the W3C
WebFonts part of CSS2 and also support that particular font format.
> 
> >There are a couple of odd things, firstly the doctype is wrong (this isn't
> >a frameset) and secondly, there is absolutely no need to use UTF-8 for this
> >content, becase in fact it only uses ascii. All other characters, suchas
> >accented letters and greek, are done usingentities or NCRs.
> 
> Chris, you were right, my doctype was wrong; only the first file
> ( http://www.bibl.ulaval.ca/doelec/theses/memoires/1999/ChRiviere/riv.htm )
> is a frameset, I corrected the others. Secondly, *SOME* of my greek
> characters are not ASCII, otherwise they would all display well in any
> font, which is not the case. 

But you don't have any Greek characters in there, that I could see. You
have some SGML code, written in ASCII, which will cause Greek characters to
appear in the parsed document.

> Concerning NCR (Numeric character reference, I
> presume), if you look at the source of my file, you will see some NCR that
> are below 256 (by ex. &#233; ) thus, ISO-Latin; but many are over 256, thus
> unicode. So I must send this in UTF-8, must I ?

No. This is a critical distinction. Firstly there is the encoding used to
transmit the document - in you case, only ASCII is used. In the second
case, there is the document character set, which in the case of HTML and
XML is Unicode.

All NCRs (yes you have the correct expansion of NCR, sorry I should have
spelled it out in ful on first usage) refer to the Document Character Set,
not the encoding used in transmission. So if I send a file in shift-JIS or
8859-7 or whatever, and talk about character &#235; it means "" always,
regardless of what codepoint 235 happens to be in the encodingused in
transmission.

> My problem was not to display the text on MY computer, my problem was to
> insure that the text will be displayed the most efficiently in other
> computers. In fact, I had trouble finding a computer which couldn't display
> my document correctly (fortunately a secretary near my desk has an old
> computer ;- )    )

Yes, I only succeed in getting the document to display incorrectly by using
Opera, which has a well-known deficiency in that area.

> I really appreciate your explanation: if I copy my text from the web and
> paste it in an application that can display it, I will see the correct text
> because the text will have been correctly transmitted.
> Like Andrew suggests, installing the Pan-European language support gives us
> more up to date versions of Fonts which  will then be available to other
> applications
> I think, following these discussions that I will continue to use UTF-8 to
> encode the texts which are not only ASCII because, if these texts are not
> always displayable now they will be in the near future, as softwares and OS
> are being upgraded. (hopefully)

Yes. But then, you don't need to use entities or NCRs to transmit the Greek
characters - you can just type them directly.

> One last question,
> As most of my text is ISO-LATIN and -- if I remember well the stats
> produced by Weft -- only 228 characters are UTF-8. 

Actually none of the characters are UTF-8 in your sample. However, I am
prepared to believe that there are 228 characters with Unicode code points
above 256.

> Is it preferable to
> send anyway all the files using UTF-8 or is it possible to embed in the
> text a coding commanding a switch to UTF-8 in a specific portion of the
> text ? (something like:  <DIV charset="UTF8> is it possible ???)

No. Firstly, because parsing has to know what the charset is, and
interpretation of various element names happens after parsing. And also
because it isn't really necesary and confers no benefit, and would be
really messy to specify.

--
Chris
Received on Monday, 8 May 2000 16:35:15 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT