Re: [Encoding] false statement [I18N-ACTION-328][I18N-ISSUE-374] from John C Klensin on 2014-09-02 (www-international@w3.org from July to September 2014)

From: John C Klensin <john+w3c@jck.com>
Date: Tue, 02 Sep 2014 15:07:51 -0400
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Andrew Cunningham <lang.support@gmail.com>
cc: Anne van Kesteren <annevk@annevk.nl>, Addison Phillips <addison@lab126.com>, Richard Ishida <ishida@w3.org>, www-international@w3.org, Larry Masinter <masinter@adobe.com>
Message-ID: <0ECB38CD90E3F988D2D2A51A@JcK-HP8200.jck.com>
--On Tuesday, 02 September, 2014 13:01 +0900 "\"Martin J.
Dürst\"" <duerst@it.aoyama.ac.jp> wrote:

> On 2014/09/01 08:31, John C Klensin wrote:
>> Andrew (and, by the way, John Cowan),
> 
>> I think we have some historically-established ways of doing
>> that which we know how to handle.  I'd hate to see us go back
>> to 2022 and expand that registry but I can also imagine its
>> being an interesting (and non-conflicting) solution while
>> waiting for Unicode and, if ISO/IEC JTC1/SC2 isn't willing to
>> maintain and update that registry, I can imagine several
>> entities who could take over.
> 
> The registry is 'alive' at http://itscj.ipsj.or.jp/ISO-IR/.
> The last addition dates from 2004
> (http://itscj.ipsj.or.jp/ISO-IR/234.pdf).

Where I would have expected it to be.  My concern, which I
didn't express optimally, was only that is hard to be certain
how a process would work that had not been exercised in ten
years and that is generally considered obsolescent. 

>> If the Unicode Consortium understands and is
>> convinced that this has become a serious problem,
> 
> This has been a well-known problem throughout the development
> of Unicode. There always was "we haven't done X yet" or "we
> don't cover Y yet". The problem is becoming less serious in
> the sense that the overall number of people affected is
> decreasing. The problem is becoming more serious in the sense
> that the easy (read well-documented) targets have been done.

And harder and more serious in the sense that some of us believe
that preserving the languages and writing systems of minority
populations deserves special consideration while, e.g., the
populations who use mostly-undecorated Latin or Han scripts have
a long track record of being able to defend themselves vocally
and economically.   The observation that those "targets" are
also less well documented may be partially an effect of the same
reduced economic and political leverage.

>> perhaps they
>> could start conditionally reserving some blocks for
>> as-yet-uncoded scripts so at least there could be unambiguous
>> migration paths, perhaps via a new subspecial of compatibility
>> mappings or providing surrogate-like escapes to other code
>> points that would parallel the 2022 system.
> 
> There's the private use area(s) for experiments. It's big
> enough that escapes are not necessary at all (thank goodness).
> But as somebody else has already has written, you are strictly
> on your own.

And "strictly on your own" makes it a non-solution for anyone
who is concerned about overlapping allocations (private or
otherwise) with the same label/identifiers.

>> If the official Unicode Consortium position were really
>> "people should just wait to use their languages until we get
>> around to assigning code points and we reserve the right to
>> take as many years as we like" and the official WHATWG (much
>> less W3C) position were really "if your language and script
>> don't have officially assigned Unicode code points, you don't
>> get to be on the web" then it is probably time for the
>> broader community to do something about those groups.
>> Fortunately I haven't heard anyone who can reasonably claim
>> to speak for any of those bodies say anything like that.  If
>> you have, references would be welcome.
> 
> I'd guess the Unicode position is something like "If you want
> to use Unicode, then you should wait until we manage to assign
> code points. We'll try hard but it will take time." Whether
> you think that this position is the same or different from the
> above depends on whether you care about intent or not.

I have never seen any convincing evidence of ill-intent by
anyone in this area and have a huge amount of sympathy for
"let's take our time and get it right" strategies.  At the same
time, the international community and the evolution of the web
puts a huge amount of pressure on populations to get themselves
and culturally and language-relevant materials online lest they
end up even more disadvantaged than they have been.  I strongly
dislike character coding systems that are based on
identification and designation of code pages (for all of the
reasons I trust everyone reading this understands), but their
own advantage over Unicode is that it is feasible to create a
page, register a label, and then deprecate (and eventually
abandon) it and replace it with another one if the coding
principles turn out to have been wrong.  Because Unicode is a
single, integrated, system with very strong stability
requirements, it is much more important to get things right the
first time because a second attempt is likely to be extremely
painful if it is even possible.

> As for the WHATWG position, I'd like to remind you that the
> IETF, e.g. for IDNA, in essence has the same policy. You also
> need officially assigned Unicode code points.

Hmm.   I wonder what, exactly, you are referring to.   Certainly
IDNA requires officially assigned Unicode code points.   But the
IDNA criteria include an extremely strong requirement for
stability and predictable string comparisons, far more than is
normally associated with running text and content material more
generally.   Moreover, some artificial (IMO) excitement around
the name-selling community notwithstanding, names are a lot less
important than the content itself -- if one cannot name a page
or other resource optimally, it doesn't prevent having the
content.  But, without content that is usably by the relevant
populations, names are pretty useless.  It seems to me that,
discussions of next-generation URLs aside, HTML5, Encoding
specs, allowed character sets and labels, etc., are much closer
to content.

Similarly, while I hope it will be less important in the future
than it has been in the past, the IANA Charset Registry is
intended to be an open resource, with new CCSs and Encodings
able to be registered on the basis of a good description of what
was/is being done, not passing some test of goodness, whether
goodness is defined in reference to alternate ways to code were
are more or less the same things or by the number of
current-generation web browsers that support it.

  best,
    john


> Regards,   Martin.
Received on Tuesday, 2 September 2014 19:08:26 UTC