RE: [Encoding] false statement [I18N-ACTION-328][I18N-ISSUE-374] from John C Klensin on 2014-08-28 (www-international@w3.org from July to September 2014)

From: John C Klensin <john+w3c@jck.com>
Date: Thu, 28 Aug 2014 18:59:04 -0400
To: Larry Masinter <masinter@adobe.com>, Richard Ishida <ishida@w3.org>, "Phillips, Addison" <addison@lab126.com>
cc: www-international@w3.org
Message-ID: <7319E4ABD237CC469ECC1E26@JcK-HP8200.jck.com>
--On Thursday, August 28, 2014 20:20 +0000 Larry Masinter
<masinter@adobe.com> wrote:

>> I predict (as I'm sure
>> you would) that any attempt in the IETF to either depreciate
>> the Registry or incompatibly revise/ update particular
>> definitions would meet with a great deal of resistance, based
>> in part on existing use in applications that are not web
>> browsers. 
> 
> I'm sure there would be some resistance, but there's resistance
> to everything.  Which applications don't want to be compatible
> with the web?  I think it's worth a try , to do the right
> thing.

Assuming your compatibility question is not just rhetorical, any
well-established application whose use, in practice, depends on
the IANA Charset Registry definitions, including registry
entries that the Encoding spec essentially bans entirely.  Using
the application used to process and transmit these messages as
an example, it is also noteworthy that the number of web
browsers, or even web servers, in use is fairly small.  By
contrast, the number of SMTP clients and servers, including
independently-developed submission clients built into embedded
devices, is huge and the number of mail user agents even larger.
So an instruction from the IETF (or W3C or some other entity) to
those email systems to abandon the IANA Registry's definitions
in favor of some other norm would, pragmatically, be likely to
make things worse rather than better, creating a number of
variations on the theme I think Andrew Cunningham is concerned
about, i.e., even more systems that use a given charset label
but that interpret it in different ways.

>> I would
>> expect much the same response if we somehow told the browser
>> community that the IANA definitions were around long before
>> their current generation of work and products, are
>> well-established on the Internet, and that they should mend
>> their ways even if it caused some existing pages to stop
>> working.
> 
> This document is part of the mending.

It is interesting, and perhaps illustrative of the issues here,
that you read my remark that way.  From my perspective as
someone who was involved with email definitions long before
there was a web, who first got tangled up with the problems of
transmitting CCS and encoding information out of band back when
Kermit was the state of the art in textual data interchange with
multiple character sets, when I said "mend their ways", I meant
"stop this nonsense and, if you use a label that appears in the
IANA Charset Registry, use is to describe _exactly_ what is
defined there and, if you don't like that, define your own label
and put it in the Registry.  

To a considerable extent, that means I see the Encoding document
as institutionalizing the problem.  I dislike that, but all of
the alternatives seem to be worse at the moment.

>> I don't like the solution of saying what amounts to "if you
>> are a web browser using HTML5, you should, for compatibility
>> with others, use these definitions and not the IANA ones".
>> But, given that neither community is likely to agree to
>> change its ways, it may be the least bad alternative. 
> 
> I'm not sure the communities are separate. There's one
> Internet and text flows readily between web and non-web.
> Sure there are people who subscribe to one list or
> another.

Sure.  But that and scale measured in numbers of deployed
independent implementations and the difficulties associated with
changing them, would seem to argue strongly for at least mostly
changing the web browsers to conform to what is in the IANA
registry (possibly there are Registry entries that might need
tuning too --the IETF Charset procedures don't allow that at
present but, at you point out, they could, at least in
principle, be changed) rather than trying to retune the Internet
to match what a handful of browser vendors are doing.

>> .... Might "more historical information and discussion of
>> use by non-web applications" be useful in that regard?  I tend
>> to agree with you that it would, but I gather there is some
>> resistance to making it part of the encoding document.
> 
> Sometimes you have to do more work than you want to,
> In order to make things right. But I'm not sure it's  really
> all that much.  Maybe all that's needed is a pointer from
> the IANA registry to this document and vice versa, telling
> readers to be aware of the other, and encouraging new
> applications to use utf-8.

As I said to Andrew Cummingham, that, when you say "use utf-8"
you are almost certainly talking about using UTF-8 encoding with
Standard Unicode code point assignments (or following what the
IANA Registry presumably says, as you prefer),  Given that and
speaking personally rather than predicting IETF reactions, I
would see no problem at all annotating the IANA Registry entries
for a few Charsets with comments that an alternate
interpretation has been seen in the wild, that those using that
Charset should consequently use caution, and, ideally,
describing what the deviations are.    That wouldn't do much for
the pseudo-Unicode posing as UTF-8 situation that Andrew
describes, but it would probably work reasonably well for, e.g.,
the "sometimes 'us-ascii' is really Windows 1252" problem.

If you and others thought it worthwhile to see if we can figure
out an appropriate IETF mechanism to create that annotation, I'd
be happy to collaborate.   Notes about reality, however
unfortunate that reality is, should always be welcome.   It
would, however, probably not be worth the effort if all the
current Encoding spec has to say on the subject is equivalent to
"don't pay any attention to whatever the IANA Charset Registry
says" (or worse).

     john
Received on Thursday, 28 August 2014 22:59:32 UTC