- From: Řistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Thu, 13 Mar 2008 02:04:58 +0100
On 5th June 2007, ?istein E. Andersen wrote: > (To do this properly, what we really ought to do is look for > C1 and undefined characters in all IANA charsets and semi-official > mappings to Unicode and check 1) whether the gaps can be filled > by borrowing from other encodings, and 2) whether browsers > actually do so. [...]) I have finally got round to looking at superset encodings. To do this, I started with Unicode mappings from [UNI] for 8-bit 1-byte alphabet encodings and added mappings for other such encodings implemented in Opera, Safari or Firefox, mostly from [CSETS], though I made one for Windows-Sami-2 from a PDF. (I then discovered that IE had something called Arabic-ASMO, for which no matching specification could be found, and subsequently reverse-engineered all IE's encodings. Most of these turned out to be identical to other mappings or only add characters from the PUA, but some real differences were found, and those are reported in the text below.) [UNI] <http://unicode.org/Public/MAPPINGS/> [CSETS] <http://crl.nmsu.edu/~mleisher/csets.html> All the character repertoires and encoding vectors defined by the mappings were then compared pairwise. (Codepoints mapped to C0, space, BS or C1 were treated as unassigned, and directionality indicators for Arabic and Hebrew were ignored.) The result is quite a big and unreadable table [FULL], so the repertoires and encodings were clustered, which gave rise to the tables in [ENC], which compare charsets with less than 27 incompatible codepoints, as well as those in [REP], which compare charsets with at most 60 characters not found in both repertoires. (The thresholds are arbitrary, but more than sufficiently large to assure that all related charsets will be clustered together and at the sime time sufficiently small to keep the tables at a reasonable size.) [FULL] <http://coq.no/X/charset-table.html> [ENC] <http://coq.no/X/charset-enc.html> [REP] <http://coq.no/X/charset-rep.html> A short summary of the most interesting/relevant results (supported by [ENC]) can be found below. -- ?istein E. Andersen PS: How should colour be added to tables like these in HTML5 with neither of the attributes bgcolor and style? PPS: Some right-to-left characters contaminate surrounding characters as I have not yet found a simple solution to make everything strictly left-to-right (probably because I have not looked for it properly). -------- Notation -------- x < y: x is a proper subset of y ===== ASCII ===== Most of the charsets are ASCII-compatible; some are EBCDIC-based (none of which are implemented in browsers, as far as I know). The following are /almost/ ASCII-compatible: CP864 uses Arabic per cent in place of of the Latin sign. JIS-201 replaces `reverse solidus' and `tilde' with `yen' and `macron'. See below for PostScript / NextStep. ====================================== Arabic, including MacArabic / MacFarsi ====================================== Both MacArabic and MacFarsi are close to being supersets of 8859-6. The Macintosh encodings encode explicitly right-to-left characters `dollar' `space' and `hyphen' in place of ISO's `generic currency sign', `non- breaking space' and `soft hyphen'. MS IE's so-called ASMO-708 (not treated as an 8859-6 alias as per IANA) appears to be another rough superset of 8859-6, adding accented lowercase letters for French and box-drawing characters, but apparently soft hyphen or non-breaking space. MS IE also includes Arabic-DOS, which appears to be different from all other encodings. Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from ISO-2022-JP. This is something to keep in mind when looking at multi-byte encodings. ========== Baltic Rim ========== Despite what Wikipedia says, 8859-13 and CP1257 are not actually compatible; the latter puts `acute accent' and `high dot' where the former has `left double quotation mark' and `right single quotation mark'. ============ Cyrillic KOI ============ There are several KOI8-based encodings, all of which include the basic Russian modern alphabet (except yo) in an ASCII-compatible sequence. KOI8-unified is almost a superset of ISO-IR-111, but uppercase and lowercase Ukrainian `Cyrillic g with upturn' replace `generic currency sign' and `soft hyphen'. IE's KOI-8-U is different as it includes short uppercase and lowercase y instead of two box-drawing characters. Comments: KOI8-RU (as opposed to KOI8-R and KOI8-U) is apparently obsolete and best forgotten. KOI8-unified shows all letters from any KOI8-based encoding correctly. This one therefore seems like the best choice if distributional analysis indicates KOI-8 of some description. ======== Georgian ======== GEO-STD-8 and GEO-PS are mostly compatible, except that the former has `No' where the latter has `y acute'. (GEO-STD-8 is supposedly supported by Firefox, but does not seem to work for me, so I cannot test it.) ===== Greek ===== 8859-7-1987a contains `modifier letter reversed comma' and `modifier letter apostrophe' as opposed to `left single quotation mark' and `right single quotation mark' in 8859-7-1987b. The original mappings likely have something to do with the fact that Greek apostrophe is supposed to have the same visual appearance as a soft breathing mark. 8859-7-1987b < 8859-7 CP1253 is close to the 8859-7 encodings (4--6 incompatible assignments), but `capital alpha with acute' is placed at different positions, which makes unification difficult. In IE, Greek-ISO is based on 8859-7-1987a (+PUA). ====== Hebrew ====== Unlike what Wikipedia says, the Unicode mappings suggest that CP1255 and 8859-8 are not actually compatible; however, the only incompatible assignment is `sheqel sign' in CP1255 v. `generic currency sign' in 8859-8, which is not really more serious than what Apple did when incorporating `euro'. Furthermore, the `double underline' symbol present in 8859-8 only would need to be included in a unified encoding. IE's Hebrew-ISO uses a different macron or high horizontal line and also uses PUA characters instead of ordinary right-to-left and left-to-right marks. ========================== MacCyrillic / MacUkrainian ========================== Apple originally implemented MacCyrillic with `cent' and `partial derivative', and MacUkrainian with uppercase and lowercase `Cyrillic g with upturn'. The modern MacCyrillic is like the old MacUkrainian, but with `euro' instead of `generic currency sign'. Firefox lists MacUkrainian, but this appears to be a mere synonym for modern MacCyrillic and not a separate encoding. Suggestion: Implement modern MacCyrillic only. ======== MacGreek ======== Microsoft's table reflect an older version of the encoding in which `soft hyphen' occupied the position now taken up by `euro'. Additionally, Greek semicolon is mapped to U+0387 rather than U+00B7, which is the preferred character according to Unicode 5.0. As for MacIcelandic, only Apple's mapping includes `Apple' in PUA. Firefox implements Apple's modern version. ============ MacIcelandic ============ Microsoft provides an older version with `generic currency sign' instead of `euro'. Microsoft also uses `ohm symbol' where Apple has chosen `uppercase omega', but these typically have identical appearance. Furthermore, only Apple's mapping includes `Apple' in PUA. Firefox implements Apple's modern version. ======== MacRoman ======== See: MacIcelandic This encoding is obviously implemented in its modern version in Safari as well. ========== MacTurkish ========== Ohm/omega and `Apple' as for MacIcelandic. Apple maps a genuinely undefined codepoint to a special PUA character which was also used for `euro' (but not in this case). As a result, Firefox maps the undefined codepoint to `euro'. ===================== PostScript / NextStep ===================== I have seen documentation from Next clearly stating that the NextStep encoding was designed as a superset of Adobe's PostScript Standard Encoding, which again is based on a previous version of US-ASCII which allowed what is now `straight apostrophe' and `grave accent' to be interpreted as `curly apostrophe' and `left single quotation mark'. Discrepancies between the two are likely to result from diverging interpretations rather than intentional differences. In any case, neither of these encodings seem to be particularly relevant for HTML (although quite a few plain-text documents, including this very e-mail, assume the old-style US-ASCII interpretation.) ==== Thai ==== TIS-620 < 8859-11 < CP874 MacThai is close to being a superset of TIS-620 and 8859-11 as well, but unfortunately replaces three Thai letters with `TM', `(C)' and `(R)'. ======= Turkish ======= 8859-9 < CP1254 Unless an experimental error has occurred, Turkish-ISO and Turkish-Windows both refer to the superset in IE. ========== Vietnamese ========== (TC)VN5712-2 < (TC)VN5712-1 Opera and Firefox seem to have implemented the superset only. ================ Western-European ================ 8859-1 < CP1252 ------- THE END -------
Received on Wednesday, 12 March 2008 18:04:58 UTC