Re: [Encoding] false statement [I18N-ACTION-328][I18N-ISSUE-374] from John C Klensin on 2014-08-31 (www-international@w3.org from July to September 2014)

From: John C Klensin <john+w3c@jck.com>
Date: Sun, 31 Aug 2014 14:38:31 -0400
To: Anne van Kesteren <annevk@annevk.nl>
cc: Larry Masinter <masinter@adobe.com>, Richard Ishida <ishida@w3.org>, "Phillips, Addison" <addison@lab126.com>, www-international@w3.org
Message-ID: <6C350C0268201B20CCDD9FEE@[192.168.1.128]>

--On Sunday, 31 August, 2014 20:04 +0200 Anne van Kesteren
<annevk@annevk.nl> wrote:

> On Thu, Aug 28, 2014 at 8:08 PM, John C Klensin
> <john+w3c@jck.com> wrote:
>> Where we seem to be today is that there are a lot of charset
>> labels in the IANA Charset Registry.  Some of them are
>> irrelevant to web browsers (and, depending on how one defines
>> it, to the web generally).  Others are used in web browsers
>> but with exactly the same definitions as appear in the IANA
>> Registry.  And a few are used --widely so-- in web browsers
>> but with different definitions.  At the same time, there are
>> other applications (and probably some legacy web ones) that
>> use the labels in the last category but strictly follow the
>> IANA Registry definitions.
> 
> Actually, quite a lot have different definitions when you get
> down to the details, because the specifications for the
> encodings IANA points to are often not implemented in the
> prescribed manner (or lack essential details, such as handling
> of errors).

To the extent to which that is true, we get back to a variation
on Larry's point about making changes where they belong: If we
need to update the IANA registry by specifying the essential
details that have been omitted, encouraging updating of relevant
entries with those details, and then marking everything left as
"dangerous" or "suspect" until they are updated, that would seem
to me to be entirely reasonable.  I think the IETF community
would require a persuasive case about those deviant
implementations and evidence that they are really the result of
definitional gaps or misunderstandings, but that should be easy
if it were really the case.  Things like using a Windows code
page in place of ASCII (the "us-ascii" charset registration)
doesn't fit those criteria -- it is a simple decision to
deviate, whether for good reasons or not.

I think it also leads to a question that I consider legitimate,
which is why we would expect conformance to the Encoding spec to
be significantly better than conformance to the IANA Charset
registry definitions.   It is not clear to me that either is
less ambiguous than the other.

>> The one solace here and the one I hope all involved can agree
>> on (or have already) is that, with the exception of writing
>> systems whose scripts have not yet been encoded in Unicode,
>> everyone ought to be moving away from historical encodings
>> and toward UTF-8 as soon as possible.  That is the real
>> solution to the problem of different definitions and the
>> issues they can cause: just move forward to Standard UTF-8 to
>> get away from them and consider the present mess as added
>> incentive.

> Writing systems that cannot be done in Unicode cannot be done
> on the web. There's no infrastructure in place for such
> systems. (Apart from PUA font hacks.)

I agree.   I believe that folks who have need for writing
systems not supported by Unicode should sort that out with
Unicode.  Patience may be hard, but future interoperability and
compatibility problems are likely to be much worse.   I made the
observation because there have been a number of comments on this
list and elsewhere that PUA font hacks, squatting on unassigned
code points, and use of private-use code points, all identified
as "UTF-8", are common practice.

best,
   john

Received on Sunday, 31 August 2014 18:38:59 UTC