W3C home > Mailing lists > Public > whatwg@whatwg.org > August 2013

Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

From: NARUSE, Yui <naruse@airemix.jp>
Date: Sun, 4 Aug 2013 02:19:22 +0900
Message-ID: <CAK6HhspQBunmwin=0vz+OdFGCZKeXyL_TZYgb6vWqXeu1PVvzA@mail.gmail.com>
To: Ian Hickson <ian@hixie.ch>
Cc: whatwg <whatwg@lists.whatwg.org>, Martin Janecke <whatwg.org@prlbr.com>
2013/8/1 Ian Hickson <ian@hixie.ch>:
> On Thu, 1 Aug 2013, Martin Janecke wrote:
>>
>> I don't see any sense in making a document that is declared as
>> ISO-8859-1 and encoded as ISO-8859-1 non-conforming. Just because the
>> ISO-8859-1 code points are a subset of windows-1252? So is US-ASCII.
>> Should an US-ASCII declaration also be non-conforming then -- even if
>> the document only contains bytes from the US-ASCII range? What's the
>> benefit?
>>
>> I assume this is supposed to be helpful in some way, but to me it just
>> seems wrong and confusing.
>
> If you avoid the bytes that are different in ISO-8859-1 and Win1252, the
> spec now allows you to use either label. (As well as "cp1252", "cp819",
> "ibm819", "l1", "latin1", "x-cp1252", etc.)
>
> The part that I find problematic is that if you use use byte 0x85 from
> Windows 1252 (U+2026 "" HORIZONTAL ELLIPSIS), and then label the document
> as "ansi_x3.4-1968", "ascii", "iso-8859-1", "iso-ir-100", "iso8859-1",
> "iso_8859-1:1987", "us-ascii", or a number of other options, it'll still
> be valid, and it'll work exactly as if you'd labeled it "windows-1252".
> This despite the fact that in ASCII and in ISO-8859-1, byte 0x85 does not
> hap to U+2026. It maps to U+0085 in 8859-1, and it is undefined in ASCII
> (since ASCII is a 7 bit encoding).

ISO-8859-1 vs. Windows-1252 issue sounds little issue because 0x85 is Next Line.
As far as I know 0x85/U+0085 is used only in some IBM system.

For Japanese encoding, there's Shift_JIS vs. Windows-31J issue, which
people long annoyed.
Windows-31J has many new characters which aren't included in Shift_JIS,
and many different Unicode mappings from Shift_JIS.
But many existing Web pages specify "Shift_JIS" and uses characters
only in Windows-31J.
Therefore if people want to specify a document as truly Shift_JIS,
there's no way on the existing framework.
It needs a new way for example a new meta specifier like <META
i-want-to-truly-specify-charset-as="Shift_JIS">
and browser recognize the document's encoding as true Shift_JIS.

But such people should use UTF-8 instead of introducing such new one.

-- 
NARUSE, Yui  <naruse@airemix.jp>
Received on Saturday, 3 August 2013 17:20:28 UTC

This archive was generated by hypermail 2.3.1 : Monday, 13 April 2015 23:09:23 UTC