Re: Ascii as a subset from Frank Ellermann on 2007-05-21 (www-validator@w3.org from May 2007)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Mon, 21 May 2007 22:47:56 +0200
To: www-validator@w3.org
Message-ID: <4652057B.1AC@xyzzy.claranet.de>

Dana C. Chandler III wrote:

> Is there a definitive list of Character sets that have ASCII
> as  a subset?

If you find one please post its URL.  You could construct it by
installing ICU, and then check where all ASCII characters are
mapped to the same ASCII characters, and nothing else is mapped
to ASCII characters.

It also depends on your definition, UTF-16 and UTF-32 don't
have ASCII as subset if you talk about octets (8bits).  UTF-7
and UTF-1 also don't qualify.  You'd have to watch all these
charsets with code-switching (SCSU etc.), if they have states
where an ASCII octet doesn't stand for the ASCII character.

Some IBM codepages rotate SUB - DEL - FS, arguably that's not
more ASCII.  IIRC an Adobe charset also had an oddity in the
range 0x00 up to 0x7F.

The simple cases are UTF-8, Latin-1 (plus some other Latin-*),
windows-1252 (plus some other windows-*), codepage 437, 850,
858 (plus a few others, ignoring the IBM rotation), and likely
some Mac charsets (not registered, better ignore unregistered
charsets, they're hopeless, moving targets).

After that it starts to get interesting...

Frank

Received on Monday, 21 May 2007 20:50:50 UTC