Re: [Encoding] false statement [I18N-ACTION-328][I18N-ISSUE-374] from Andrew Cunningham on 2014-08-29 (www-international@w3.org from July to September 2014)

From: Andrew Cunningham <lang.support@gmail.com>
Date: Fri, 29 Aug 2014 11:17:08 +1000
To: John C Klensin <john+w3c@jck.com>
Cc: wwwintl <www-international@w3.org>, Larry Masinter <masinter@adobe.com>, "Phillips, Addison" <addison@lab126.com>, Richard Ishida <ishida@w3.org>
Message-ID: <CAGJ7U-WwcdEfj++PA7HtXUk1Y=rWP2wEnDK-r+hLfuw55CRaaw@mail.gmail.com>
Hi John, in essence I agree with you. The problem is we had a huge mess in
the past, we still have a mess, and work needs to still be done. The
encoding spec is one piece of the puzzle, but by itself will do nothing,
other aspects need to be address.

On 29 August 2014 08:07, John C Klensin <john+w3c@jck.com> wrote:

>
>
> --On Friday, August 29, 2014 05:18 +1000 Andrew Cunningham
> <lang.support@gmail.com> wrote:
>
>
> > For some languages, a sizeable amount of content is in this
> > category.
>
> First of all, just to clarify for those reading this (including,
> to some extent, me), does that "pseudo-Unicode" differ from
>
>  -- the Standard UTF-8 encoding, i.e., what a
>         standard-conforming UTF-8 encoder would produce given a
>         list of code points (assigned or not),  or
>
>
as far as I know, this does not happen


>  -- the Unicode code point assignments, i.e., it uses
>         private code space and/or "squats" on unassigned code
>         points, perhaps in so-far completely unused or sparcely
>         populated planes, or
>
>
this sometimes happens



>  -- established Unicode conventions by combining existing
>         and standardized Unicode points using conventions (and
>         perhaps special font support) about how those sequences
>         are interpreted that are not part of the Unicode
>         Standard.
>
>

This is the most common approach, but there are two sub categories in this
approach

1) the font uses opentype features combined with a visual approach .. so
most of the usage is as defined in Unicode, except with respect to
reordering.

2) a visual character model ( reminiscent of old legacy encodings) is
superimposed over a Unicode block, this often involves redefining existing
codepoints.


> It may not be your intent, but my inference from your note is
> that you don't expect those who are using their own
> pseudo-Unicode mechanisms today to ever be interested in
> converting to something more standard when  more appropriate
> code points are assigned.  If we can extrapolate from history,
> that just is not correct:
>

For instance, with respect to Burmese, (I will use this as an example since
it is the use case I am most familiar with) there are three distinct camps
within the Burmese IT community. One advocates Unicode, two advocate
differing non-standard implementations, all delivered using UTF-8 as the
character encoding.

One of the non-standard approaches has traditionally been dominant. The
reasons for this as historical, and basically involve the lag time in
Microsoft and Apple adding Myanmar support to their products.

Standard Unicode approaches have been gaining ground among Burmese
developers. But with the changing environment, especially the growing
importance of cheap mobile technologies over laptop or workstation access
to the internet in countries like Myanmar, the pendulum is swinging back in
favour of non-standard approaches. This has been deepened by a number of
handset manufacturers and telecommunications companies adding non-standard
font and input mechanisms to mobile devices, entrenching non-standard
approaches even further.

To complicate the situation each of the other key languages using Myanmar
script have their own non-Standard approaches, each incompatible with the
others.

Part if the problem is that some developers within Myanmar have been
advocating and campaigning for standard approaches, usually without outside
support, and the key people have either become too jaded, or no longer have
the time for such activities.

But, if we can't set a target of Standard-conforming Unicode
> encoded in Standard-conforming UTF-8 and start figuring out how
> to label, transform, and/or develop compatibility mechanisms
> with interim solutions (viewed as such) as needed, then we will
> find ourselves back in the state we were in before either
> "charset" labels or ISO 2022 designations using standardized
> designators.   I'm old enough to have lived through that period
> -- it was not a lot of fun, there were always ambiguities about
> what one was looking at, and, well, it just is not sustainable
> except in very homogeneous communities who don't care about
> communicating with others.
>
> The _huge_ advantage of Standard Unicode and Standard UTF-8 in
> that regard is that it provides a single interoperability
> target, not the chaos of needing N-squared solutions.
>
>
I heartily agree ... but the crux of the problem is a technical issue, not
a standards issue. Without appropriate rendering engines on devices, uptake
of standards based approaches can be difficult, if not impossible in some
cases.


> > To add to the problem some handset (mobile/cell phone) and
> > tablet manufacturers have baked in pseudo-Unicode for specific
> > languages.
>
> And so?  When, in the next generation or the one after that,
> they discover that the economics of international marketing and
> manufacturing create a strong bias toward common platforms,
> those same economics will drive them toward global standards.
> The conversions and backward-compatibility issues, whether
> sorted out on the newer phones or, more likely, on servers that
> deal with the older ones, will probably be painful, especially
> so if they have identified what they are doing simply as
> "utf-8", but the conversions will occur or those who are
> sticking to the pseudo-Unicode conventions will find themselves
> at a competitive disadvantage, eventually one severe enough to
> focus the attention.
>
>
in theory, but that is at least 5 years away.

I agree with you, migration and enablement of standards compliant Unicode
needs to occur, The sooner the better. But I would argue that developers
need to be proactive in fostering this. It is not just a question of
encodings and migrating to standards based Unicode implementations, its all
a question of access to appropriate fonts, font rendering technologies, and
font fallback approaches.

For instance there is not a single Unicode font that supports all of the
Myanmar script adequately. A pan-Myanmar font would need to support a
minimum of six opentype language systems (including DFLT). Then browsers
need to expose these in their UIs allowing users to tailor their default
fonts to their needs.

Even UTN 11 appears to have major gaps and deficiencies for some of the
languages it covers.

>From the perspective of the reasons the IANA charset registry
> was created in the first place (and, yes, I'm among the guilty
> parties), one of its advantages is that, if one finds oneself
> using pseudo-Unicode in Lower Slobbovia, one could fairly easily
> register and use "utf-8-LowerSlobbovian" and at least tell
> others (and oneself in the future) what character coding and
> encoding form was being used.  Borrowing a note or two from
> Larry, if I were involved in the pseudo-Unicode community, I'd
> be pushing for IETF action to define a naming convention and
> subcategory of charsets so that, e.g., someone receiving
> "utf-8-LowerSlobbovian" could know that what they were getting
> was basically UTF-8 and Unicode except in specific areas or code
> point ranges.  But maybe this generation of systems needs to
> experience the chaos of unidentified character repertoires and
> encodings before that will be considered practical.
>
> > As the expression goes, that ship has already sailed.
>
> If the ship that has sailed is taking us into the sea in which
> moderately precise identification of repertoires, coding, and
> encoding forms is impossible, I suggest that it will take on
> water rapidly enough that we'd better start thinking about
> either salvage operations or alternate ships.
>
> As the other expression goes, been there, done that, didn't
> enjoy it much.
>

Ditto, just have to look at how much legacy content was labelled as
iso-8859-1 or windows-125, when it wasn't, in order to get web content to
display in lots of different languages.And something that is still being
done for some languages.

I heartily endorse Unicode, It is all I use, has been for a long time.

Unfortunately there is still work that needs to be done.



-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunningham@slv.vic.gov.au
          lang.support@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
Received on Friday, 29 August 2014 01:17:37 UTC