From: email@example.com (Dave Salovesh) To: firstname.lastname@example.org, email@example.com Subject: Re: International chars in HTML files-(LONG) Date: Wed, 17 Jan 1996 11:35:58 GMT Message-Id: <firstname.lastname@example.org> In-Reply-To: <email@example.com> I've done a bit of research since this came up, and I felt like summarizing the current (17/Jan/1995) state of things. A long summary? Yes, due to attached excerpts. Oh, well. Needless to say, I live in the wrong part of the world to find this type of information on any nearby sites. Why would all the US-ASCII folks want to know any of this? :-) This thread could also be called 'What a difference a bit makes!' 1) HTML uses ISO-8859-1, an 8-bit character set, codes 0-255, by default. 8859-1 is the current default for HTTP - HTML documents may fully use the 8859-1 set in the context of HTTP. There is no need to use codes or entity names (7-bit expressions) for 8859-1 characters, within the limits of your text editor and keyboard. 2) Codes or names -must- be used to replace characters which would otherwise be interpreted as mark-up. There are four [<>&"], and they conform to ISO standards for their codes and names. Other codes or names from 8859-1 may be used to avoid similar confusion, e.g, [/\-_]. 3) Either the server or the browser may be responsible for converting codes or names into 8859-1 (or other) characters. In practice, 7-bit expressions seem to be passed to the browser for conversion, but I couldn't verify this. 4) Other parts of the internet may not accept 8-bitness. Mail and FTP (unless specified as binary) use 7-bits, codes 0-127. This is almost always the US-ASCII char set, but not universally. You can spare yourself some mystery by always using a 7-bit expression for an 8-bit 8859-1 character. This would clearly be preferred for English in the US, but less of an issue for documents in other languages like French or Spanish where support for 8859-1 is more likely to be found throughout the system, one would hope. (Are there any fully 8859-1 spell checkers, I wonder innocently, to myself.) See below for excerpts from RFC 1866, and also excerpts from these online resources: http://www.uni-passau.de/~ramsch/iso8859-1.html http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_9.html#SEC99 ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html This last resource has gifs of ISO 8859-1 through 8859-10 (codes 160-255), to illustrate the sets without the chance of code page conflicts. I had a tough time connecting, so I couldn't view the images: http://www.cs.tu-berlin.de/~czyborra/charsets/ Dave firstname.lastname@example.org -- from RFC 1866 >4.1. text/html media type > This specification defines the Internet Media Type [IMEDIA] (formerly > referred to as the Content Type [MIME]) called `text/html'. The > following is to be registered with [IANA]. (...) > Charset > > The charset parameter (as defined in section 7.1.1 of > RFC 1521[MIME]) may be given to specify the character > encoding scheme used to represent the HTML document as a *> sequence of octets. The default value is outside the *> scope of this specification; but for example, the *> default is `US-ASCII' in the context of MIME mail, and *> `ISO-8859-1' in the context of HTTP [HTTP]. > >4.2. HTML Document Representation > A message entity with a content type of `text/html' represents an > HTML document, consisting of a single text entity. The `charset' > parameter (whether implicit or explicit) identifies a character > encoding scheme. The text entity consists of the characters > determined by this character encoding scheme and the octets of the > body of the message entity. > >4.2.1. Undeclared Markup Error Handling I point to sec. 4.2.1. often. Please look it up at your leisure. (See, I just did it again!) This next resource has a table, with other good links at the bottom: http://www.uni-passau.de/~ramsch/iso8859-1.html >Please note that there is nothing wrong with using characters of ISO >Latin-1 above 127: HTTP/1.0 uses the 8bit ISO latin-1 as default encoding. >(Thanks to Roman Czyborra for pointing this out!) http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_9.html#SEC99 >Character Entity Sets > >The HTML DTD defines the following entities. They represent particular >graphic characters which have special meanings in places in the markup, or >may not be part of the character set available to the writer. > >Numeric and Special Graphic Entity Set > >The following table lists each of the characters included from the Numeric >and Special Graphic entity set, along with its name, syntax for use, and >description. This list is derived from `ISO Standard 8879:1986//ENTITIES >Numeric and Special Graphic//EN'. However, HTML does not include for the >entire entity set -- only the entities listed below are included. > >GLYPH NAME SYNTAX DESCRIPTION < lt < Less than sign > gt > Greater than sign & amp & Ampersand " quot " Double quote sign > >ISO Latin 1 Character Entity Set > >The following public text lists each of the characters specified in the >Added Latin 1 entity set, along with its name, syntax for use, and >description. This list is derived from ISO Standard 8879:1986//ENTITIES >Added Latin 1//EN. HTML includes the entire entity set. > ><!-- (C) International Organization for Standardization 1986 > Permission to copy in any form is granted for use with > conforming SGML systems and applications as defined in > ISO 8879, provided this notice is included in all copies. >--> ><!-- Character entity set. Typical invocation: > <!ENTITY % ISOlat1 PUBLIC > "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML"> > %ISOlat1; >--> ><!-- Modified for use in HTML >$Id: ISOlat1.sgml,v 1.2 1994/11/30 23:45:12 connolly Exp $ --> ><!ENTITY AElig CDATA "Æ" -- capital AE diphthong (ligature) --> ><!ENTITY Aacute CDATA "Á" -- capital A, acute accent --> (...) ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets >These are the official names for character sets that may be used in >the Internet and may be referred to in Internet documentation. These >names are expressed in ANSI_X3.4-1968 which is commonly called >US-ASCII or simply ASCII. The character set most commonly use in the >Internet and used especially in protocol standards is US-ASCII, this >is strongly encouraged. The use of the name US-ASCII is also >encouraged. (...) >Name: ISO-8859-1 >MIBenum: 1004 >Source: IBM Latin-1 SAA Core Coded Character Set. > Extended ISO 8859-1 Presentation Set, GCSGID: 2039 >Alias: csUnicodeIBM2039 NOTE: The following excerpt -really- needs to be placed in the correct context. There are many other points covered by this document, and I take no responsibility for any misunderstanding. The full document is at http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html (...) >Several different character codes feature in the discussion below. (Except >for EBCDIC) they are all extensions of the 7-bit US-ASCII code, and >therefore they coincide with US-ASCII and with each other in the lower >half, code points 0-127 (decimal). In the upper half they differ, both in >the repertoire of glyphs which they represent, and in the assignment of >glyphs to code points. In this note we do not need to consider national >variants of ASCII (as laid down in the old standard, ISO646), in which one >or more code points differ from the 7-bit US-ASCII code, e.g the UK >variant that has a pound sterling where US-ASCII has the dollar. Nor do we >consider the use of the 8th bit as a parity bit, this is irrelevant to and >incompatible with our discussion. (...) >The HTTP specification mandates the use of the code ISO8859-1 as the >default character code that is passed over the network. The HTML >specification is also formulated in terms of the ISO8859-1 code, and an >HTML document that is transmitted using the HTTP protocol is by default in >the ISO8859-1 code (note - if an HTTP document is transmitted by MIME mail >then the default encoding is US-ASCII, see HTML2.0 spec for details). (...) >As far as authors of HTML is concerned, character coding is an issue for >them in two contexts: (1) where authors create files that actually contain >characters from the upper half of the 8-bit code table, and (2) where they >refer to such characters by their &#number; representation. If authors >confine their use of characters to the low half of the 8-bit table (i.e >the area defined by the US-ASCII 7-bit code), and represent any characters >from the upper half by their &entity; (which is to be preferred, where an >entity name is available) or by their &#number; representation, then point >(1) is not an issue, and furthermore, when transferring files between >platforms by various means - ARPA FTP, email, diskette etc. - there is no >need to worry which particular 8-bit code is native to the sending and >receiving platforms. For these reasons, this is an approach that is much >to be recommended. Where a file has been composed in another form (for >example, by typing in accented characters using a non-English-language >keyboard), it might be wise to use one of the utility programs that >convert to an & representation of the characters in question. (...) Dave email@example.com Note: Unsolicited email of a commercial nature to this address (firstname.lastname@example.org) carries a $150 fee for advertisement. See http://www.tezcat.com/tezcat-aup.html for more details.