Re: character problems

Chris Lilley (chris@w3.org)
Wed, 17 Jul 1996 12:52:06 +0200


Message-ID: <31ECC5D6.794B@w3.org>
Date: Wed, 17 Jul 1996 12:52:06 +0200
From: Chris Lilley <chris@w3.org>
To: Stephanos Piperoglou <stephanos@hol.gr>
CC: www-html@w3.org
Subject: Re: character problems

Stephanos Piperoglou wrote:

> Now NORMALLY, and sticking to standards, I can't even write HTML in Gre=
ek.

I agree that this has been a problem in the past, because HTML 1
standardised on the Latin-1 (ISO 8859-1) character set. (But hey this
was an improvement on ASCII). But even with HTML 2.0 a clear direction
towards Internationalisation was shown by making the document character
set Unicode (restricted in that version to the first 256 code positions,
ie the same as 8859-1 in practice).

The IETF HTML WG Internationalisation draft is in last call  which
removes this restriction; the full basic multilingual plane of Unicode
is available. Since HTML makes a distinction between the document
character set (ie the logical computational space for character
manipulation, such as resolving numeric entity references) and the
character encoding used to transmit the document (as indicated by the
charset parameter) you can create Greek html pages which are correctly
labelled and conform to specifications. Just send them out with the MIME
type:

Content-Type: text/html; charset=3Diso-9959-7


For an example of this, see:

  http://www.alis.com:8085/demo/grec/ntua.html

Which is correctly labelled, look:

  bash$ telnet www.alis.com 8085
  Trying 207.81.28.7...
  Connected to www.alis.com.
  Escape character is '^]'.
  HEAD /demo/grec/ntua.html HTTP/1.0

  HTTP/1.0 200 Le document suit
  Date: Wed, 17 Jul 1996 10:36:07 GMT
  Server: NCSA/1.4
  Content-type: text/html; charset=3Diso-8859-7
  Content-Language: el
  Last-modified: Mon, 30 Nov 1987 01:19:08 GMT
  Content-length: 6316

  Connection closed by foreign host.

> Official Athens College site:    http://www.gsc.net/hosted/athens_colle=
ge/
Compare this with your server:

  bash$ telnet www.gsc.net 80
  Trying 204.57.142.57...
  Connected to www.gsc.net.
  Escape character is '^]'.
  GET /hosted/athens_college/ HTTP/1.0

  HTTP/1.0 200 OK
  Server: Netscape-Communications/1.1
  Date: Wednesday, 17-Jul-96 10:42:20 GMT
  Last-modified: Tuesday, 25-Jun-96 05:42:00 GMT
  Content-length: 1787
  Content-type: text/html        <=3D=3D oops!

  <HEAD>
 [ ... stuff omitted ...]
  <P>Welcome to Athens College! These pages have been set up by the
Athens College   Computer Society in order to bring our fine institution
to the Internet, but     =

  also to bring the Internet to the College. These pages are
unfortunately not yet =

  fully bilingual, so please click on the gate to proceed. If you can't
display =

  Greek characters with your browser, you might want to have a look <A =

  HREF=3D"http://users.hol.gr/~stephanos/greek.html">here</A>.</P>
  <TD WIDTH=3D50% VALIGN=3Dtop>
  <P>=D3=E1=F2 =EA=E1=EB=F9=F3=EF=F1=DF=E6=EF=F5=EC=E5 =F3=F4=EF =CA=EF=EB=
=EB=DD=E3=E9=EF =C1=E8=E7=ED=FE=ED! =CF=E9 =F3=E5=EB=DF=E4=E5=F2 =E1=F5=F4=
=DD=F2 =DD=F7=EF=F5=ED
=E5=E3=EA=E1=F4=E1=F3=F4=E1=E8=E5=DF =E1=F0=FC   =F4=EF=ED =BC=EC=E9=EB=EF=
 =D5=F0=EF=EB=EF=E3=E9=F3=F4=FE=ED =F4=EF=F5 =CA=EF=EB=EB=E5=E3=DF=EF=F5 =
=C1=E8=E7=ED=FE=ED =E3=E9=E1 =ED=E1
=F6=DD=F1=EF=F5=ED =F4=EF =DF=E4=F1=F5=EC=E1 =E1=F5=F4=FC =F3=F4=EF   =

  Internet, =E1=EB=EB=DC =EA=E1=E9 =E3=E9=E1 =ED=E1 =F6=DD=F1=EF=F5=ED =F4=
=EF Internet =F3=F4=EF =CA=EF=EB=EB=DD=E3=E9=EF. =CF=E9 =F3=E5=EB=DF=E4=E5=
=F2
=E4=E5=ED =DD=F7=EF=F5=ED   =

  =EC=E5=F4=E1=F6=F1=E1=F3=E8=E5=DF =E1=EA=FC=EC=E1 =F3=E5 =E4=FD=EF =E3=EB=
=FE=F3=F3=E5=F2, =EF=F0=FC=F4=E5 =F0=E1=F1=E1=EA=E1=EB=EF=FD=EC=E5 =F0=E1=
=F4=DE=F3=F4=E5 =F3=F4=E7=ED =F0=FD=EB=E7
=E3=E9=E1 =ED=E1 =

  =F3=F5=ED=E5=F7=DF=F3=E5=F4=E5:</P>


For information on the different 8859 character sets, see:

  http://www.cs.tu-berlin.de/~czyborra/charsets/

Further details of I18N work:

 http://www.w3.org/pub/WWW/International/
 http://www.alis.com:8085/ietf/html/

> However if you have the coreect font installed on your browser =


Urgh. A Font is an ordered collection of glyphs, the order being given
by the font's encoding vector. A character set is an ordered collection
of characters. Please do not create HTML pages containing garbage
characters by assuming a one-to-one ordered mapping between the two. On
the other hand if you lablel your HTML then the appropriate character
set and font can be autonmatically selected by compliant browsers, even
if the font encoding vector does not match the character encoding used
to transmit the document.

Fonts are one component of an I18N solution, not the whole answer.

> Netscape 3.0b4 and later
> supportr iso-8859-7 (greek) character sets (so hurrah, though I have no=

> idea how to make my pages recognizable as Greek by Netscape...  unless
> every user does Options > Document Encoding > Greek one he meets my pag=
e).

Send your documents out labelled as I said above, then Netscape (and
other browsers) will recognise them and switch character sets and fonts
for you automatically.

> Even the newest versions of Netscape under Windows 95 won't let you ent=
er
> non-english characters in forms! =


Yes, adding an accept-charset attribute to form input fields was another
thing that the Internationalisation draft did. Then you can create a
form that accepts Greek, you type in Greek, it gets sent to the server
CGI script correctly labelled.

Have a look at the tango browser, which implements the I18N
specification. See:

http://www.alis.com/

I have no connection with the Alis company, just pleased to see another
step towards a World Wide Web.

-- =

Chris Lilley, W3C                          [ http://www.w3.org/ ]
http://www.w3.org/people/chris/                       INRIA/W3C
chris@w3.org                       2004 Rt des Lucioles / BP 93
+33 93 65 79 87            06902 Sophia Antipolis Cedex, France