[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range] from Øistein E. Andersen on 2009-10-22 (public-whatwg-archive@w3.org from October 2009)

From: Øistein E. Andersen <liszt@coq.no>
Date: Thu, 22 Oct 2009 21:23:43 +0100
Message-ID: <CE58097B-5429-4264-AE84-4BAD459943E4@coq.no>

On 22 Oct 2009, at 17:15, NARUSE, Yui wrote:

> First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets,

I am not sure what you mean; they are both listed at
<http://www.iana.org/assignments/character-sets>:

Name: JIS_C6226-1983                                     [RFC1345,KXS2]
MIBenum: 63
Source: ECMA registry
Alias: iso-ir-87
Alias: x0208
Alias: JIS_X0208-1983
Alias: csISO87JISX0208

Name: JIS_X0212-1990                                     [RFC1345,KXS2]
MIBenum: 98
Source: ECMA registry
Alias: x0212
Alias: iso-ir-159
Alias: csISO159JISX02121990

> moreover those correct names as spec are JIS X 0208 and JIS X 0212.

(The IANA registry is internally inconsistent and often disagrees with  
official standards when it comes to capitalisation, dashes/hyphens,  
underscores and spaces, so it is difficult to get this right. Please  
excuse me for not always paying due attention to such details in e- 
mails. Of course, the specifications should follow either IANA or the  
official standard as appropriate, depending on what it is referring to.)

> Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
> ASCII compatible. So they are out of discouraged; mustn't use.

EBCDIC is clearly not ASCII-compatible and may be unique amongst the  
character sets in the IANA registry in providing the full ASCII  
repertoire in a different arrangement.

JIS_C6226-1983 and JIS_X0212-1990 as defined in RFC1345 (i.e., on  
their own) do not contain basic ASCII characters at all, so it makes  
little sense to use them for HTML documents without adding ASCII or  
the ASCII-based JIS C 6220-1969, which would give something like EUC- 
JP or ISO-2022-JP.  JIS_C6226-1983 contains wide versions of ASCII  
characters, but those are not interpreted as HTML mark-up (unless I am  
mistaken). JIS_X0212-1990 does not contain ASCII, kana or basic kanji,  
so it is of extremely limited usefulness on its own even in a plain- 
text setting.  Warning against completely useless encodings seems  
pointless.

Many other encodings in the IANA registry are ASCII-incompatible in  
different ways; what I do not understand is what makes the ones  
currently mentioned in the HTML5 draft particularly harmful.

> Finally, Why ISO 2022 series is discouraged is not clear.

We agree on this point.

> Anyway, most of charsets defined RFC 1345 are not clear.
> Conversion table between [those charsets and] Unicode is needed.

Quite.  Anne van Kesteren, I and several others are currently trying  
to document how browsers handle different encodings at
<http://wiki.whatwg.org/wiki/Web_Encodings>, and defining mappings to  
Unicode is one of the goals.  Your contribution would be much  
appreciated.

-- 
?istein E. Andersen

Received on Thursday, 22 October 2009 13:23:43 UTC