- From: John C Klensin <john+w3c@jck.com>
- Date: Thu, 28 Aug 2014 18:07:28 -0400
- To: Andrew Cunningham <lang.support@gmail.com>
- cc: wwwintl <www-international@w3.org>, Larry Masinter <masinter@adobe.com>, "Phillips, Addison" <addison@lab126.com>, Richard Ishida <ishida@w3.org>
--On Friday, August 29, 2014 05:18 +1000 Andrew Cunningham <lang.support@gmail.com> wrote: > On 29/08/2014 4:10 AM, "John C Klensin" <john+w3c@jck.com> > wrote: > >> >> The one solace here and the one I hope all involved can agree >> on (or have already) is that, with the exception of writing >> systems whose scripts have not yet been encoded in Unicode, >> everyone ought to be moving away from historical encodings >> and toward UTF-8 as soon as possible. That is the real >> solution to the problem of different definitions and the >> issues they can cause: just move forward to Standard UTF-8 to >> get away from them and consider the present mess as added >> incentive. > > Unfortunately, that ship has already sailed. UTF-8 already > suffers from the same problem. The term some of us use for it > is pseudo-Unicode. I have been somewhat aware of the problem; it is why I said "Standard UTF-8". Obviously (at least from my perspective), for the cases you are talking about, a migration path is needed for pseudo-Unicode too. > For some languages, a sizeable amount of content is in this > category. First of all, just to clarify for those reading this (including, to some extent, me), does that "pseudo-Unicode" differ from -- the Standard UTF-8 encoding, i.e., what a standard-conforming UTF-8 encoder would produce given a list of code points (assigned or not), or -- the Unicode code point assignments, i.e., it uses private code space and/or "squats" on unassigned code points, perhaps in so-far completely unused or sparcely populated planes, or -- established Unicode conventions by combining existing and standardized Unicode points using conventions (and perhaps special font support) about how those sequences are interpreted that are not part of the Unicode Standard. It seems to me that the implications of those three types of deviations (to use the most neutral term I can think of) are quite different. In particular, if the practices are common enough, it is possible to imagine workarounds, including transitional and backward-compatibility models such as careful application of specialized canonical decomposition and/or compatibility relationships, that would smooth over a lot of the interoperability problems. If I were part of UTC, I might find some of those mechanisms a lot more (or less) attractive than others, but anyone willing to adopt unstandardized (and I assume inconsistent) pseudo-Unicode models is probably going to be willing to adopt equally unstandardized normalization or other mapping methods if necessary. It may not be your intent, but my inference from your note is that you don't expect those who are using their own pseudo-Unicode mechanisms today to ever be interested in converting to something more standard when more appropriate code points are assigned. If we can extrapolate from history, that just is not correct: while I expect that we will have to deal with, e.g., ASCII and ISO 8859-1 web pages as legacy / historical issues in the web for many years to come, the reality is that most of the community is migrating off of them, especially for new materials, just as it/we migrated a generation ago from ISO 646 national character positions and then from ISO 2022 designation of various code pages. The migration is obviously not complete: I hope the web is in better shape, but I've received multiple email messages coded in what the IANA Charset Registry calls ISO-2022-JP and GB2312 within the last couple of hours. But, if we can't set a target of Standard-conforming Unicode encoded in Standard-conforming UTF-8 and start figuring out how to label, transform, and/or develop compatibility mechanisms with interim solutions (viewed as such) as needed, then we will find ourselves back in the state we were in before either "charset" labels or ISO 2022 designations using standardized designators. I'm old enough to have lived through that period -- it was not a lot of fun, there were always ambiguities about what one was looking at, and, well, it just is not sustainable except in very homogeneous communities who don't care about communicating with others. The _huge_ advantage of Standard Unicode and Standard UTF-8 in that regard is that it provides a single interoperability target, not the chaos of needing N-squared solutions. > To add to the problem some handset (mobile/cell phone) and > tablet manufacturers have baked in pseudo-Unicode for specific > languages. And so? When, in the next generation or the one after that, they discover that the economics of international marketing and manufacturing create a strong bias toward common platforms, those same economics will drive them toward global standards. The conversions and backward-compatibility issues, whether sorted out on the newer phones or, more likely, on servers that deal with the older ones, will probably be painful, especially so if they have identified what they are doing simply as "utf-8", but the conversions will occur or those who are sticking to the pseudo-Unicode conventions will find themselves at a competitive disadvantage, eventually one severe enough to focus the attention. >From the perspective of the reasons the IANA charset registry was created in the first place (and, yes, I'm among the guilty parties), one of its advantages is that, if one finds oneself using pseudo-Unicode in Lower Slobbovia, one could fairly easily register and use "utf-8-LowerSlobbovian" and at least tell others (and oneself in the future) what character coding and encoding form was being used. Borrowing a note or two from Larry, if I were involved in the pseudo-Unicode community, I'd be pushing for IETF action to define a naming convention and subcategory of charsets so that, e.g., someone receiving "utf-8-LowerSlobbovian" could know that what they were getting was basically UTF-8 and Unicode except in specific areas or code point ranges. But maybe this generation of systems needs to experience the chaos of unidentified character repertoires and encodings before that will be considered practical. > As the expression goes, that ship has already sailed. If the ship that has sailed is taking us into the sea in which moderately precise identification of repertoires, coding, and encoding forms is impossible, I suggest that it will take on water rapidly enough that we'd better start thinking about either salvage operations or alternate ships. As the other expression goes, been there, done that, didn't enjoy it much. john
Received on Thursday, 28 August 2014 22:07:59 UTC