Re: International chars in HTML files-(LONG)

Dave Salovesh (darsal@tezcat.com)
Wed, 17 Jan 1996 11:35:58 GMT


From: darsal@tezcat.com (Dave Salovesh)
To: www-html@w3.org, darsal@tezcat.com
Subject: Re: International chars in HTML files-(LONG)
Date: Wed, 17 Jan 1996 11:35:58 GMT
Message-Id: <30fcc030.12181291@mail.tezcat.com>
In-Reply-To: <v02120d03ad21dbc4b9f8@[205.149.180.135]>

I've done a bit of research since this came up, and I felt like summarizing
the current (17/Jan/1995) state of things.  A long summary?  Yes, due to
attached excerpts.  Oh, well.

Needless to say, I live in the wrong part of the world to find this type of
information on any nearby sites.  Why would all the US-ASCII folks want to
know any of this? :-)  This thread could also be called 'What a difference a
bit makes!'

1) HTML uses ISO-8859-1, an 8-bit character set, codes 0-255, by default.
8859-1 is the current default for HTTP - HTML documents may fully use the
8859-1 set in the context of HTTP.  There is no need to use codes or entity
names (7-bit expressions) for 8859-1 characters, within the limits of your
text editor and keyboard.

2) Codes or names -must- be used to replace characters which would otherwise
be interpreted as mark-up.  There are four [<>&"], and they conform to ISO
standards for their codes and names.  Other codes or names from 8859-1 may
be used to avoid similar confusion, e.g, [/\-_].

3) Either the server or the browser may be responsible for converting codes
or names into 8859-1 (or other) characters.  In practice, 7-bit expressions
seem to be passed to the browser for conversion, but I couldn't verify this.

4) Other parts of the internet may not accept 8-bitness. Mail and FTP
(unless specified as binary) use 7-bits, codes 0-127.  This is almost always
the US-ASCII char set, but not universally.  You can spare yourself some
mystery by always using a 7-bit expression for an 8-bit 8859-1 character.
This would clearly be preferred for English in the US, but less of an issue
for documents in other languages like French or Spanish where support for
8859-1 is more likely to be found throughout the system, one would hope.

(Are there any fully 8859-1 spell checkers, I wonder innocently, to myself.)

See below for excerpts from RFC 1866, and also excerpts from these online
resources:

http://www.uni-passau.de/~ramsch/iso8859-1.html
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_9.html#SEC99
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html

This last resource has gifs of ISO 8859-1 through 8859-10 (codes 160-255),
to illustrate the sets without the chance of code page conflicts.  I had a
tough time connecting, so I couldn't view the images:

http://www.cs.tu-berlin.de/~czyborra/charsets/

Dave
darsal@tezcat.com
--

from RFC 1866

>4.1. text/html media type
>   This specification defines the Internet Media Type [IMEDIA] (formerly
>   referred to as the Content Type [MIME]) called `text/html'. The
>   following is to be registered with [IANA].
(...)
>    Charset
>
>            The charset parameter (as defined in section 7.1.1 of
>            RFC 1521[MIME]) may be given to specify the character
>            encoding scheme used to represent the HTML document as a
*>           sequence of octets. The default value is outside the
*>           scope of this specification; but for example, the
*>           default is `US-ASCII' in the context of MIME mail, and
*>           `ISO-8859-1' in the context of HTTP [HTTP].
>
>4.2. HTML Document Representation
>   A message entity with a content type of `text/html' represents an
>   HTML document, consisting of a single text entity. The `charset'
>   parameter (whether implicit or explicit) identifies a character
>   encoding scheme. The text entity consists of the characters
>   determined by this character encoding scheme and the octets of the
>   body of the message entity.
>   
>4.2.1. Undeclared Markup Error Handling

I point to sec. 4.2.1. often.  Please look it up at your leisure.  (See, I
just did it again!)

This next resource has a table, with other good links at the bottom:
http://www.uni-passau.de/~ramsch/iso8859-1.html

>Please note that there is nothing wrong with using characters of ISO 
>Latin-1 above 127: HTTP/1.0 uses the 8bit ISO latin-1 as default encoding. 
>(Thanks to Roman Czyborra for pointing this out!) 

http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_9.html#SEC99

>Character Entity Sets
>
>The HTML DTD defines the following entities. They represent particular 
>graphic characters which have special meanings in places in the markup, or 
>may not be part of the character set available to the writer. 
>
>Numeric and Special Graphic Entity Set
>
>The following table lists each of the characters included from the Numeric 
>and Special Graphic entity set, along with its name, syntax for use, and 
>description. This list is derived from `ISO Standard 8879:1986//ENTITIES 
>Numeric and Special Graphic//EN'. However, HTML does not include for the 
>entire entity set -- only the entities listed below are included. 
>
>GLYPH   NAME      SYNTAX       DESCRIPTION
 <       lt      &lt;    Less than sign
 >       gt      &gt;    Greater than sign
 &       amp     &amp;   Ampersand
 "       quot    &quot;  Double quote sign
>
>ISO Latin 1 Character Entity Set
>
>The following public text lists each of the characters specified in the 
>Added Latin 1 entity set, along with its name, syntax for use, and 
>description. This list is derived from ISO Standard 8879:1986//ENTITIES 
>Added Latin 1//EN. HTML includes the entire entity set. 
>
><!-- (C) International Organization for Standardization 1986
>     Permission to copy in any form is granted for use with
>     conforming SGML systems and applications as defined in
>     ISO 8879, provided this notice is included in all copies.
>-->
><!-- Character entity set. Typical invocation:
>     <!ENTITY % ISOlat1 PUBLIC
>       "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML">
>     %ISOlat1;
>-->
><!--    Modified for use in HTML        
>$Id: ISOlat1.sgml,v 1.2 1994/11/30 23:45:12 connolly Exp $ -->
><!ENTITY AElig  CDATA "&#198;" -- capital AE diphthong (ligature) -->
><!ENTITY Aacute CDATA "&#193;" -- capital A, acute accent -->
(...)

ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets

>These are the official names for character sets that may be used in
>the Internet and may be referred to in Internet documentation.  These
>names are expressed in ANSI_X3.4-1968 which is commonly called
>US-ASCII or simply ASCII.  The character set most commonly use in the
>Internet and used especially in protocol standards is US-ASCII, this
>is strongly encouraged.  The use of the name US-ASCII is also
>encouraged.
(...)
>Name: ISO-8859-1
>MIBenum: 1004
>Source: IBM Latin-1 SAA Core Coded Character Set.
>        Extended ISO 8859-1 Presentation Set, GCSGID: 2039
>Alias: csUnicodeIBM2039

NOTE: The following excerpt -really- needs to be placed in the correct
context.  There are many other points covered by this document, and I take
no responsibility for any misunderstanding.  The full document is at
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html

(...)
>Several different character codes feature in the discussion below. (Except 
>for EBCDIC) they are all extensions of the 7-bit US-ASCII code, and 
>therefore they coincide with US-ASCII and with each other in the lower 
>half, code points 0-127 (decimal). In the upper half they differ, both in 
>the repertoire of glyphs which they represent, and in the assignment of 
>glyphs to code points. In this note we do not need to consider national 
>variants of ASCII (as laid down in the old standard, ISO646), in which one 
>or more code points differ from the 7-bit US-ASCII code, e.g the UK 
>variant that has a pound sterling where US-ASCII has the dollar. Nor do we 
>consider the use of the 8th bit as a parity bit, this is irrelevant to and 
>incompatible with our discussion. 
(...)
>The HTTP specification mandates the use of the code ISO8859-1 as the 
>default character code that is passed over the network. The HTML 
>specification is also formulated in terms of the ISO8859-1 code, and an 
>HTML document that is transmitted using the HTTP protocol is by default in 
>the ISO8859-1 code (note - if an HTTP document is transmitted by MIME mail 
>then the default encoding is US-ASCII, see HTML2.0 spec for details).
(...)
>As far as authors of HTML is concerned, character coding is an issue for 
>them in two contexts: (1) where authors create files that actually contain 
>characters from the upper half of the 8-bit code table, and (2) where they 
>refer to such characters by their &#number; representation. If authors 
>confine their use of characters to the low half of the 8-bit table (i.e 
>the area defined by the US-ASCII 7-bit code), and represent any characters 
>from the upper half by their &entity; (which is to be preferred, where an 
>entity name is available) or by their &#number; representation, then point 
>(1) is not an issue, and furthermore, when transferring files between 
>platforms by various means - ARPA FTP, email, diskette etc. - there is no 
>need to worry which particular 8-bit code is native to the sending and 
>receiving platforms. For these reasons, this is an approach that is much 
>to be recommended. Where a file has been composed in another form (for 
>example, by typing in accented characters using a non-English-language 
>keyboard), it might be wise to use one of the utility programs that 
>convert to an & representation of the characters in question. 
(...)

Dave
darsal@tezcat.com

  Note: Unsolicited email of a commercial nature to this address
        (darsal@tezcat.com) carries a $150 fee for
        advertisement. See http://www.tezcat.com/tezcat-aup.html
        for more details.