RE: [Encoding] false statement [I18N-ACTION-328][I18N-ISSUE-374] from John C Klensin on 2014-08-28 (www-international@w3.org from July to September 2014)

From: John C Klensin <john+w3c@jck.com>
Date: Thu, 28 Aug 2014 14:08:43 -0400
To: Larry Masinter <masinter@adobe.com>, Richard Ishida <ishida@w3.org>, "Phillips, Addison" <addison@lab126.com>
cc: www-international@w3.org
Message-ID: <A9D26AB55A1CC0CA5E17D413@JcK-HP8200.jck.com>
Hi Larry,

I detest the current situation, especially because of the exact
point you make in your initial paragraph: needing to know the
exact context and time in which a label is used to know what it
actually means serves no one well.  

The problem, as I understand it, is that some folks in the web
browser community found it in their interest to apply their own
definitions and extensions to IANA-registered labels.  IMO, that
was an absolutely terrible idea from a global interoperability
standpoint (and several others), but complaining about or
lamenting it now will accomplish very little.  I'm just
guessing, but I'd assume that once page authors started relying
on the variant interpretations, they ran to other vendors and
said "browser X is doing this, why aren't you supporting it too"
and because customer base is often more important than
Standards, the others at least mostly went along.  That turns
"standards violation, bad idea, and bad practice" into
"established existing practice".  It isn't the only sequence of
that sort to have moved through the browser community.  I am
concerned about the implications of almost all of them, but my
concern, or yours, bears the usual relationship to the price of
a cup of coffee.

Where we seem to be today is that there are a lot of charset
labels in the IANA Charset Registry.  Some of them are
irrelevant to web browsers (and, depending on how one defines
it, to the web generally).  Others are used in web browsers but
with exactly the same definitions as appear in the IANA
Registry.  And a few are used --widely so-- in web browsers but
with different definitions.  At the same time, there are other
applications (and probably some legacy web ones) that use the
labels in the last category but strictly follow the IANA
Registry definitions.

That is a problem.  I think it is a pretty offensive one.  But
objecting to it will get us nowhere.   I predict (as I'm sure
you would) that any attempt in the IETF to either depreciate the
Registry or incompatibly revise/ update particular definitions
would meet with a great deal of resistance, based in part on
existing use in applications that are not web browsers.  I would
expect much the same response if we somehow told the browser
community that the IANA definitions were around long before
their current generation of work and products, are
well-established on the Internet, and that they should mend
their ways even if it caused some existing pages to stop working.

I don't like the solution of saying what amounts to "if you are
a web browser using HTML5, you should, for compatibility with
others, use these definitions and not the IANA ones".  But,
given that neither community is likely to agree to change its
ways, it may be the least bad alternative.  If it is, there is
still a question of how the above should be best stated to avoid
sounding like a "pox on your house; no, a pox on yours" style of
debate.   Might "more historical information and discussion of
use by non-web applications" be useful in that regard?  I tend
to agree with you that it would, but I gather there is some
resistance to making it part of the encoding document.

The one solace here and the one I hope all involved can agree on
(or have already) is that, with the exception of writing systems
whose scripts have not yet been encoded in Unicode, everyone
ought to be moving away from historical encodings and toward
UTF-8 as soon as possible.  That is the real solution to the
problem of different definitions and the issues they can cause:
just move forward to Standard UTF-8 to get away from them and
consider the present mess as added incentive.

I wish there were a better solution, but I don't have one.  If
you do, please suggest it.

There are, of course, lessons about the risks and disadvantages
in this that we should all remember for other areas and future
cases.

All just my opinion, of course.

   john


--On Thursday, August 28, 2014 15:27 +0000 Larry Masinter
<masinter@adobe.com> wrote:

> It isn't to anyone's benefit that there are two conflicting
> sources of info about character encodings.
> 
> I think if the IANA Character Sets registry is obsolete, the
> right thing is to write an Internet Draft saying it's
> obsolete, and pointing people to this document instead.
> 
> If you get objections from folks in the IETF, then address
> those objections; for example, by including more historical
> information and discussion of use by non-web applications.
> 
> So no, I don't find the resolution satisfactory. I'm willing
> to help push through such a document in the IETF but would
> like some help.
> 
> Larry
> --
> http://larry.masinter.net
>
Received on Thursday, 28 August 2014 18:09:14 UTC