W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > November 2010

[Bug 11423] Character sets not registered with IANA

From: <bugzilla@jessica.w3.org>
Date: Sun, 28 Nov 2010 19:52:15 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1PMnIF-0007TX-0H@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11423

--- Comment #2 from brian m. carlson <sandals@crustytoothpaste.net> 2010-11-28 19:52:14 UTC ---
(In reply to comment #1)
> (In reply to comment #0)
> > HTML5 should not be encouraging
> > people to use a character set that the creator has not even bothered to
> > register with IANA.
> 
> It doesn't.

When a user agent would otherwise use an encoding given in the first column of
the following table to either convert content to Unicode characters or convert
Unicode characters to bytes, it *must* instead use the encoding given in the
cell in the second column of the same row. When a byte or sequence of bytes is
treated differently due to this encoding aliasing, it is said to have been
misinterpreted for compatibility. (Emphasis mine.)

EUC-KR and KS_C_5601-1987 are mapped onto windows-949.  I think a "must"
directive is definitely an encouragement, even if you don't.

> > It's not like registering a character set with IANA is a particularly difficult or drawn-out process �
> 
> And yet Microsoft's attempt to do so (back in 2005) seems to have failed:
> 
> http://mail.apps.ietf.org/ietf/charsets/msg01510.html

Probably because, as the responses indicate, the specifications for those
character sets were insufficient and contradictory.  It doesn't matter what
exactly the reason is; it's not registered.  HP, IBM, and Adobe have managed to
do it, so I'm sure that it's not impossible or unreasonably difficult.

> It's trivial to comply with this, since "preferred MIME name" is defined by the
> spec as "the name or alias labeled as 'preferred MIME name' in the IANA
> Character Sets registry, if there is one, or the encoding's name, if none of
> the aliases are so labeled". The name of windows-949 is "windows-949".

I believe "if there is one" means "if there is a name or alias labeled as
'preferred MIME name'", not "if there is an entry in the IANA Character Sets
registry".  Even if we were to use your suggested interpretation, there are
other names for this character set, such as "CP949".  How are we to know what
the preferred name is if it's not IANA-registered?

> "User agents must at a minimum support the UTF-8 and Windows-1252 encodings,
> but may support more."

Right, but if they support EUC-KR or KS_C_5601-1987, they are effectively
required to.  (Actually, the spec seems to prohibit the useful implementation
of EUC-KR, since it's mandated that user agents use something else instead.) 
If it's acceptable to support EUC-KR and not windows-949, then the spec should
so state.

> > I must therefore object to suggesting or encouraging the use of windows-949
> > until it has been registered appropriately with IANA.
> 
> Maybe try registering it? Perhaps you'll have better luck than Microsoft.

I'm really not interested in registering what amount to platform-specific
character sets.  Plus, since I don't use that platform, I have no knowledge
about what the mapping should look like or whether it is correct.  Finally,
there are numerous character sets in existence that handle Korean just fine,
including UTF-8, and I don't see the need to add more.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Sunday, 28 November 2010 19:52:18 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 28 November 2010 19:52:18 GMT