RE: [Encoding] false statement [I18N-ACTION-328][I18N-ISSUE-374] from John C Klensin on 2014-08-28 (www-international@w3.org from July to September 2014)

From: John C Klensin <john+w3c@jck.com>
Date: Thu, 28 Aug 2014 18:07:28 -0400
To: Andrew Cunningham <lang.support@gmail.com>
cc: wwwintl <www-international@w3.org>, Larry Masinter <masinter@adobe.com>, "Phillips, Addison" <addison@lab126.com>, Richard Ishida <ishida@w3.org>
Message-ID: <6700E19E38E37C3586C11D9A@JcK-HP8200.jck.com>
--On Friday, August 29, 2014 05:18 +1000 Andrew Cunningham
<lang.support@gmail.com> wrote:

> On 29/08/2014 4:10 AM, "John C Klensin" <john+w3c@jck.com>
> wrote:
> 
>> 
>> The one solace here and the one I hope all involved can agree
>> on (or have already) is that, with the exception of writing
>> systems whose scripts have not yet been encoded in Unicode,
>> everyone ought to be moving away from historical encodings
>> and toward UTF-8 as soon as possible.  That is the real
>> solution to the problem of different definitions and the
>> issues they can cause: just move forward to Standard UTF-8 to
>> get away from them and consider the present mess as added
>> incentive.
> 
> Unfortunately, that ship has already sailed. UTF-8 already
> suffers from the same problem. The term some of us use for it
> is pseudo-Unicode.

I have been somewhat aware of the problem; it is why I said
"Standard UTF-8".  Obviously (at least from my perspective), for
the cases you are talking about, a migration path is needed for
pseudo-Unicode too.

> For some languages, a sizeable amount of content is in this
> category.

First of all, just to clarify for those reading this (including,
to some extent, me), does that "pseudo-Unicode" differ from 

 -- the Standard UTF-8 encoding, i.e., what a
	standard-conforming UTF-8 encoder would produce given a
	list of code points (assigned or not),  or
	
 -- the Unicode code point assignments, i.e., it uses
	private code space and/or "squats" on unassigned code
	points, perhaps in so-far completely unused or sparcely
	populated planes, or
	
 -- established Unicode conventions by combining existing
	and standardized Unicode points using conventions (and
	perhaps special font support) about how those sequences
	are interpreted that are not part of the Unicode
	Standard.

It seems to me that the implications of those three types of
deviations (to use the most neutral term I can think of) are
quite different.  In particular, if the practices are common
enough, it is possible to imagine workarounds, including
transitional and backward-compatibility models such as careful
application of specialized canonical decomposition and/or
compatibility relationships, that would smooth over a lot of the
interoperability problems.  If I were part of UTC, I might find
some of those mechanisms a lot more (or less) attractive than
others, but anyone willing to adopt unstandardized (and I assume
inconsistent) pseudo-Unicode models is probably going to be
willing to adopt equally unstandardized normalization or other
mapping methods if necessary.  

It may not be your intent, but my inference from your note is
that you don't expect those who are using their own
pseudo-Unicode mechanisms today to ever be interested in
converting to something more standard when  more appropriate
code points are assigned.  If we can extrapolate from history,
that just is not correct: while I expect that we will have to
deal with, e.g., ASCII and ISO 8859-1 web pages as legacy /
historical issues in the web for many years to come, the reality
is that most of the community is migrating off of them,
especially for new materials, just as it/we migrated a
generation ago from ISO 646 national character positions and
then from ISO 2022 designation of various code pages.  The
migration is obviously not complete: I hope the web is in better
shape, but I've received multiple email messages coded in what
the IANA Charset Registry calls ISO-2022-JP and GB2312 within
the last couple of hours.  

But, if we can't set a target of Standard-conforming Unicode
encoded in Standard-conforming UTF-8 and start figuring out how
to label, transform, and/or develop compatibility mechanisms
with interim solutions (viewed as such) as needed, then we will
find ourselves back in the state we were in before either
"charset" labels or ISO 2022 designations using standardized
designators.   I'm old enough to have lived through that period
-- it was not a lot of fun, there were always ambiguities about
what one was looking at, and, well, it just is not sustainable
except in very homogeneous communities who don't care about
communicating with others.

The _huge_ advantage of Standard Unicode and Standard UTF-8 in
that regard is that it provides a single interoperability
target, not the chaos of needing N-squared solutions.

> To add to the problem some handset (mobile/cell phone) and
> tablet manufacturers have baked in pseudo-Unicode for specific
> languages.

And so?  When, in the next generation or the one after that,
they discover that the economics of international marketing and
manufacturing create a strong bias toward common platforms,
those same economics will drive them toward global standards.
The conversions and backward-compatibility issues, whether
sorted out on the newer phones or, more likely, on servers that
deal with the older ones, will probably be painful, especially
so if they have identified what they are doing simply as
"utf-8", but the conversions will occur or those who are
sticking to the pseudo-Unicode conventions will find themselves
at a competitive disadvantage, eventually one severe enough to
focus the attention.

>From the perspective of the reasons the IANA charset registry
was created in the first place (and, yes, I'm among the guilty
parties), one of its advantages is that, if one finds oneself
using pseudo-Unicode in Lower Slobbovia, one could fairly easily
register and use "utf-8-LowerSlobbovian" and at least tell
others (and oneself in the future) what character coding and
encoding form was being used.  Borrowing a note or two from
Larry, if I were involved in the pseudo-Unicode community, I'd
be pushing for IETF action to define a naming convention and
subcategory of charsets so that, e.g., someone receiving
"utf-8-LowerSlobbovian" could know that what they were getting
was basically UTF-8 and Unicode except in specific areas or code
point ranges.  But maybe this generation of systems needs to
experience the chaos of unidentified character repertoires and
encodings before that will be considered practical.
 
> As the expression goes, that ship has already sailed.

If the ship that has sailed is taking us into the sea in which
moderately precise identification of repertoires, coding, and
encoding forms is impossible, I suggest that it will take on
water rapidly enough that we'd better start thinking about
either salvage operations or alternate ships.

As the other expression goes, been there, done that, didn't
enjoy it much.

    john
Received on Thursday, 28 August 2014 22:07:59 UTC