[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range] from Øistein E. Andersen on 2008-03-13 (public-whatwg-archive@w3.org from March 2008)

From: Øistein E. Andersen <html5@xn--istein-9xa.com>
Date: Thu, 13 Mar 2008 02:04:58 +0100
Message-ID: <E1JZbsQ-000CtH-VF@node1-2.ouvaton.local>
On 5th June 2007, ?istein E. Andersen wrote:

> (To do this properly, what we really ought to do is look for
> C1 and undefined characters in all IANA charsets and semi-official
> mappings to Unicode and check 1) whether the gaps can be filled
> by borrowing from other encodings, and 2) whether browsers
> actually do so. [...])

I have finally got round to looking at superset encodings.

To do this, I started with Unicode mappings from [UNI] for 8-bit 1-byte
alphabet encodings and added mappings for other such encodings
implemented in Opera, Safari or Firefox, mostly from [CSETS], though
I made one for Windows-Sami-2 from a PDF.  (I then discovered that IE
had something called Arabic-ASMO, for which no matching specification 
could be found, and subsequently reverse-engineered all IE's encodings.
Most of these turned out to be identical to other mappings or only
add characters from the PUA, but some real differences were found,
and those are reported in the text below.)

    [UNI] <http://unicode.org/Public/MAPPINGS/>
    [CSETS] <http://crl.nmsu.edu/~mleisher/csets.html>

All the character repertoires and encoding vectors defined by the mappings
were then compared pairwise. (Codepoints mapped to C0, space, BS or C1
were treated as unassigned, and directionality indicators for Arabic and
Hebrew were ignored.) The result is quite a big and unreadable table
[FULL], so the repertoires and encodings were clustered, which gave rise to
the tables in [ENC], which compare charsets with less than 27 incompatible
codepoints, as well as those in [REP], which compare charsets with at most
60 characters not found in both repertoires. (The thresholds are arbitrary, but 
more than sufficiently large to assure that all related charsets will be
clustered together and at the sime time sufficiently small to keep the
tables at a reasonable size.)

    [FULL] <http://coq.no/X/charset-table.html>
    [ENC] <http://coq.no/X/charset-enc.html>
    [REP] <http://coq.no/X/charset-rep.html>

A short summary of the most interesting/relevant results (supported by [ENC])
can be found below.

-- 
?istein E. Andersen

PS: How should colour be added to tables like these in HTML5 with
    neither of the attributes bgcolor and style?

PPS: Some right-to-left characters contaminate surrounding characters as I
     have not yet found a simple solution to make everything strictly
     left-to-right (probably because I have not looked for it properly).



--------
Notation
--------

x < y:  x is a proper subset of y


=====
ASCII
=====

Most of the charsets are ASCII-compatible; some are EBCDIC-based
(none of which are implemented in browsers, as far as I know).

The following are /almost/ ASCII-compatible:

    CP864 uses Arabic per cent in place of of the Latin sign.

    JIS-201 replaces `reverse solidus' and `tilde' with `yen' and `macron'.

    See below for PostScript / NextStep.



======================================
Arabic, including MacArabic / MacFarsi
======================================

Both MacArabic and MacFarsi are close to being supersets of 8859-6.

The Macintosh encodings encode explicitly right-to-left characters `dollar'
`space' and `hyphen' in place of ISO's `generic currency sign', `non-
breaking space' and `soft hyphen'.

MS IE's so-called ASMO-708 (not treated as an 8859-6 alias as per IANA)
appears to be another rough superset of 8859-6, adding accented lowercase
letters for French and box-drawing characters, but apparently soft hyphen
or non-breaking space.

MS IE also includes Arabic-DOS, which appears to be different from all
other encodings.

Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from
      ISO-2022-JP. This is something to keep in mind when looking at
      multi-byte encodings.



==========
Baltic Rim
==========

Despite what Wikipedia says, 8859-13 and CP1257 are not actually compatible;
the latter puts `acute accent' and `high dot' where the former has
`left double quotation mark' and `right single quotation mark'.



============
Cyrillic KOI
============

There are several KOI8-based encodings, all of which include the basic
Russian modern alphabet (except yo) in an ASCII-compatible sequence.

KOI8-unified is almost a superset of ISO-IR-111, but uppercase and
lowercase Ukrainian `Cyrillic g with upturn' replace `generic currency
sign' and `soft hyphen'.

IE's KOI-8-U is different as it includes short uppercase and lowercase
y instead of two box-drawing characters.

Comments:  KOI8-RU (as opposed to KOI8-R and KOI8-U) is apparently obsolete
           and best forgotten.

           KOI8-unified shows all letters from any KOI8-based encoding
           correctly.  This one therefore seems like the best choice
           if distributional analysis indicates KOI-8 of some description.


========
Georgian
========

GEO-STD-8 and GEO-PS are mostly compatible, except that the former has
`No' where the latter has `y acute'.

(GEO-STD-8 is supposedly supported by Firefox, but does not seem to work
for me, so I cannot test it.)



=====
Greek
=====

8859-7-1987a contains `modifier letter reversed comma' and `modifier letter
apostrophe' as opposed to `left single quotation mark' and `right single
quotation mark' in 8859-7-1987b.  The original mappings likely have something
to do with the fact that Greek apostrophe is supposed to have the same visual
appearance as a soft breathing mark.

8859-7-1987b < 8859-7

CP1253 is close to the 8859-7 encodings (4--6 incompatible assignments), but
`capital alpha with acute' is placed at different positions, which makes
unification difficult.

In IE, Greek-ISO is based on 8859-7-1987a (+PUA).



======
Hebrew
======

Unlike what Wikipedia says, the Unicode mappings suggest that CP1255 and 8859-8
are not actually compatible; however, the only incompatible assignment is `sheqel
sign' in CP1255 v. `generic currency sign' in 8859-8, which is not really more
serious than what Apple did when incorporating `euro'.

Furthermore, the `double underline' symbol present in 8859-8 only would need to
be included in a unified encoding.

IE's Hebrew-ISO uses a different macron or high horizontal line and also uses
PUA characters instead of ordinary right-to-left and left-to-right marks.



==========================
MacCyrillic / MacUkrainian
==========================

Apple originally implemented MacCyrillic with `cent' and `partial derivative',
and MacUkrainian with uppercase and lowercase `Cyrillic g with upturn'.

The modern MacCyrillic is like the old MacUkrainian, but with `euro' instead
of `generic currency sign'.

Firefox lists MacUkrainian, but this appears to be a mere synonym for modern
MacCyrillic and not a separate encoding.

Suggestion:  Implement modern MacCyrillic only.



========
MacGreek
========

Microsoft's table reflect an older version of the encoding in which `soft
hyphen' occupied the position now taken up by `euro'.  Additionally, Greek
semicolon is mapped to U+0387 rather than U+00B7, which is the preferred
character according to Unicode 5.0.

As for MacIcelandic, only Apple's mapping includes `Apple' in PUA.

Firefox implements Apple's modern version.



============
MacIcelandic
============

Microsoft provides an older version with `generic currency sign' instead of
`euro'.  Microsoft also uses `ohm symbol' where Apple has chosen
`uppercase omega', but these typically have identical appearance.

Furthermore, only Apple's mapping includes `Apple' in PUA.

Firefox implements Apple's modern version.



========
MacRoman
========

See: MacIcelandic

This encoding is obviously implemented in its modern version in Safari as well.



==========
MacTurkish
==========

Ohm/omega and `Apple' as for MacIcelandic.

Apple maps a genuinely undefined codepoint to a special PUA character
which was also used for `euro' (but not in this case). As a result,
Firefox maps the undefined codepoint to `euro'.



=====================
PostScript / NextStep
=====================

I have seen documentation from Next clearly stating that the NextStep
encoding was designed as a superset of Adobe's PostScript Standard
Encoding, which again is based on a previous version of US-ASCII which
allowed what is now `straight apostrophe' and `grave accent' to be
interpreted as `curly apostrophe' and `left single quotation mark'.
Discrepancies between the two are likely to result from diverging
interpretations rather than intentional differences.

In any case, neither of these encodings seem to be particularly relevant
for HTML (although quite a few plain-text documents, including this very
e-mail, assume the old-style US-ASCII interpretation.)



====
Thai
====

TIS-620 < 8859-11 < CP874

MacThai is close to being a superset of TIS-620 and 8859-11 as well, but
unfortunately replaces three Thai letters with `TM', `(C)' and `(R)'.



=======
Turkish
=======

8859-9 < CP1254

Unless an experimental error has occurred, Turkish-ISO and Turkish-Windows
both refer to the superset in IE.


==========
Vietnamese
==========

(TC)VN5712-2 < (TC)VN5712-1

Opera and Firefox seem to have implemented the superset only.



================
Western-European
================

8859-1 < CP1252



-------
THE END
-------
Received on Wednesday, 12 March 2008 18:04:58 UTC