- From: Dave Salovesh <darsal@tezcat.com>
- Date: Wed, 17 Jan 1996 11:35:58 GMT
- To: www-html@w3.org, darsal@tezcat.com
I've done a bit of research since this came up, and I felt like summarizing
the current (17/Jan/1995) state of things. A long summary? Yes, due to
attached excerpts. Oh, well.
Needless to say, I live in the wrong part of the world to find this type of
information on any nearby sites. Why would all the US-ASCII folks want to
know any of this? :-) This thread could also be called 'What a difference a
bit makes!'
1) HTML uses ISO-8859-1, an 8-bit character set, codes 0-255, by default.
8859-1 is the current default for HTTP - HTML documents may fully use the
8859-1 set in the context of HTTP. There is no need to use codes or entity
names (7-bit expressions) for 8859-1 characters, within the limits of your
text editor and keyboard.
2) Codes or names -must- be used to replace characters which would otherwise
be interpreted as mark-up. There are four [<>&"], and they conform to ISO
standards for their codes and names. Other codes or names from 8859-1 may
be used to avoid similar confusion, e.g, [/\-_].
3) Either the server or the browser may be responsible for converting codes
or names into 8859-1 (or other) characters. In practice, 7-bit expressions
seem to be passed to the browser for conversion, but I couldn't verify this.
4) Other parts of the internet may not accept 8-bitness. Mail and FTP
(unless specified as binary) use 7-bits, codes 0-127. This is almost always
the US-ASCII char set, but not universally. You can spare yourself some
mystery by always using a 7-bit expression for an 8-bit 8859-1 character.
This would clearly be preferred for English in the US, but less of an issue
for documents in other languages like French or Spanish where support for
8859-1 is more likely to be found throughout the system, one would hope.
(Are there any fully 8859-1 spell checkers, I wonder innocently, to myself.)
See below for excerpts from RFC 1866, and also excerpts from these online
resources:
http://www.uni-passau.de/~ramsch/iso8859-1.html
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_9.html#SEC99
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html
This last resource has gifs of ISO 8859-1 through 8859-10 (codes 160-255),
to illustrate the sets without the chance of code page conflicts. I had a
tough time connecting, so I couldn't view the images:
http://www.cs.tu-berlin.de/~czyborra/charsets/
Dave
darsal@tezcat.com
--
from RFC 1866
>4.1. text/html media type
> This specification defines the Internet Media Type [IMEDIA] (formerly
> referred to as the Content Type [MIME]) called `text/html'. The
> following is to be registered with [IANA].
(...)
> Charset
>
> The charset parameter (as defined in section 7.1.1 of
> RFC 1521[MIME]) may be given to specify the character
> encoding scheme used to represent the HTML document as a
*> sequence of octets. The default value is outside the
*> scope of this specification; but for example, the
*> default is `US-ASCII' in the context of MIME mail, and
*> `ISO-8859-1' in the context of HTTP [HTTP].
>
>4.2. HTML Document Representation
> A message entity with a content type of `text/html' represents an
> HTML document, consisting of a single text entity. The `charset'
> parameter (whether implicit or explicit) identifies a character
> encoding scheme. The text entity consists of the characters
> determined by this character encoding scheme and the octets of the
> body of the message entity.
>
>4.2.1. Undeclared Markup Error Handling
I point to sec. 4.2.1. often. Please look it up at your leisure. (See, I
just did it again!)
This next resource has a table, with other good links at the bottom:
http://www.uni-passau.de/~ramsch/iso8859-1.html
>Please note that there is nothing wrong with using characters of ISO
>Latin-1 above 127: HTTP/1.0 uses the 8bit ISO latin-1 as default encoding.
>(Thanks to Roman Czyborra for pointing this out!)
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_9.html#SEC99
>Character Entity Sets
>
>The HTML DTD defines the following entities. They represent particular
>graphic characters which have special meanings in places in the markup, or
>may not be part of the character set available to the writer.
>
>Numeric and Special Graphic Entity Set
>
>The following table lists each of the characters included from the Numeric
>and Special Graphic entity set, along with its name, syntax for use, and
>description. This list is derived from `ISO Standard 8879:1986//ENTITIES
>Numeric and Special Graphic//EN'. However, HTML does not include for the
>entire entity set -- only the entities listed below are included.
>
>GLYPH NAME SYNTAX DESCRIPTION
< lt < Less than sign
> gt > Greater than sign
& amp & Ampersand
" quot " Double quote sign
>
>ISO Latin 1 Character Entity Set
>
>The following public text lists each of the characters specified in the
>Added Latin 1 entity set, along with its name, syntax for use, and
>description. This list is derived from ISO Standard 8879:1986//ENTITIES
>Added Latin 1//EN. HTML includes the entire entity set.
>
><!-- (C) International Organization for Standardization 1986
> Permission to copy in any form is granted for use with
> conforming SGML systems and applications as defined in
> ISO 8879, provided this notice is included in all copies.
>-->
><!-- Character entity set. Typical invocation:
> <!ENTITY % ISOlat1 PUBLIC
> "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML">
> %ISOlat1;
>-->
><!-- Modified for use in HTML
>$Id: ISOlat1.sgml,v 1.2 1994/11/30 23:45:12 connolly Exp $ -->
><!ENTITY AElig CDATA "Æ" -- capital AE diphthong (ligature) -->
><!ENTITY Aacute CDATA "Á" -- capital A, acute accent -->
(...)
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
>These are the official names for character sets that may be used in
>the Internet and may be referred to in Internet documentation. These
>names are expressed in ANSI_X3.4-1968 which is commonly called
>US-ASCII or simply ASCII. The character set most commonly use in the
>Internet and used especially in protocol standards is US-ASCII, this
>is strongly encouraged. The use of the name US-ASCII is also
>encouraged.
(...)
>Name: ISO-8859-1
>MIBenum: 1004
>Source: IBM Latin-1 SAA Core Coded Character Set.
> Extended ISO 8859-1 Presentation Set, GCSGID: 2039
>Alias: csUnicodeIBM2039
NOTE: The following excerpt -really- needs to be placed in the correct
context. There are many other points covered by this document, and I take
no responsibility for any misunderstanding. The full document is at
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html
(...)
>Several different character codes feature in the discussion below. (Except
>for EBCDIC) they are all extensions of the 7-bit US-ASCII code, and
>therefore they coincide with US-ASCII and with each other in the lower
>half, code points 0-127 (decimal). In the upper half they differ, both in
>the repertoire of glyphs which they represent, and in the assignment of
>glyphs to code points. In this note we do not need to consider national
>variants of ASCII (as laid down in the old standard, ISO646), in which one
>or more code points differ from the 7-bit US-ASCII code, e.g the UK
>variant that has a pound sterling where US-ASCII has the dollar. Nor do we
>consider the use of the 8th bit as a parity bit, this is irrelevant to and
>incompatible with our discussion.
(...)
>The HTTP specification mandates the use of the code ISO8859-1 as the
>default character code that is passed over the network. The HTML
>specification is also formulated in terms of the ISO8859-1 code, and an
>HTML document that is transmitted using the HTTP protocol is by default in
>the ISO8859-1 code (note - if an HTTP document is transmitted by MIME mail
>then the default encoding is US-ASCII, see HTML2.0 spec for details).
(...)
>As far as authors of HTML is concerned, character coding is an issue for
>them in two contexts: (1) where authors create files that actually contain
>characters from the upper half of the 8-bit code table, and (2) where they
>refer to such characters by their &#number; representation. If authors
>confine their use of characters to the low half of the 8-bit table (i.e
>the area defined by the US-ASCII 7-bit code), and represent any characters
>from the upper half by their &entity; (which is to be preferred, where an
>entity name is available) or by their &#number; representation, then point
>(1) is not an issue, and furthermore, when transferring files between
>platforms by various means - ARPA FTP, email, diskette etc. - there is no
>need to worry which particular 8-bit code is native to the sending and
>receiving platforms. For these reasons, this is an approach that is much
>to be recommended. Where a file has been composed in another form (for
>example, by typing in accented characters using a non-English-language
>keyboard), it might be wise to use one of the utility programs that
>convert to an & representation of the characters in question.
(...)
Dave
darsal@tezcat.com
Note: Unsolicited email of a commercial nature to this address
(darsal@tezcat.com) carries a $150 fee for
advertisement. See http://www.tezcat.com/tezcat-aup.html
for more details.
Received on Wednesday, 17 January 1996 06:37:01 UTC