Questions re I18N from Lee Anne Phillips on 1999-10-20 (www-xml-schema-comments@w3.org from October to December 1999)

From: Lee Anne Phillips <leeanne@leeanne.com>
Date: Wed, 20 Oct 1999 11:19:26 -0700
To: www-xml-schema-comments@w3.org
Message-Id: <4.2.0.58.19991020091122.00a0d370@shell16.ba.best.com>

Dear Working Group,

I apologize if these questions have been raised before and resolved.

Is it your intention to force the exclusive use of Western European so-called Arabic numeral glyphs for digits:

[36] digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

despite the fact that XML 1.0 defines digits far more broadly to include the glyphs used by all major language writing systems?

Also, although you do address the issue of internationalization as an explicit issue, do you contemplate using the Gregorian calendar as defined in ISO 8601 exclusively? It seems to me that the use of a standard designed as an interchange format for display and entry validation is a misuse of that standard.

As others have pointed out, there are many uses for non-Gregorian dates, including dates outside the range covered by the ISO standard. I would think that one is meant to convert local dates and times, in whatever convention is used locally, into a standard form for interchange and possibly storage in order to assure compatibility and consistency when transmitting or retrieving the data.

Scaliger's Julian Day might seem more appropriate for storage and interchange of historical dates but who can remember them or type them in? The existence of the ISO standard should not force people to enter dates and times in a foreign language using conventions set by a committee. Does it?

[41] dateLiteral ::= CCYYMMDD

Also, the full ISO standard offers several alternative interchange formats, including year, week number, and weekday number. The alternatives are ignored although they might well be more appropriate for some situations. It's difficult to understand whether these things are just placeholders until you fix them or whether you plan to leave them in place and work on "more important" things. You don't really expound on these philosophical details in the two working drafts.

I can imagine forcing user input and display to be converted from and to these internal representations by some process in the user agent, which makes some sense from the standpoint of programming the validation routines, but in other places you seem to envision users entering the information directly and without intervening steps.

It seems to me that some sort of locale-specific processing is inevitable in an international community. In spite of Curt Arnold's thoughtful idea of locale-specific choices for postal codes, for example, postal codes vary in their preferred location by country, before or after the city, after the state or province, with or without an intervening comma, on a line by themselves, and so on. State or city names may be required to be in all capital letters locally or there may be no political division between the city and the country at all. And of course a local address should leave off the country for local destinations while sending a letter outside one's country requires that the destination country be included.

The same problem exists for telephone numbers which, although fairly regular in North America, may vary in length within many other countries and may include DTMF hex digits A, B, C, and D in some military PBX and internal Telco applications, even in the US. It's possible to use DTMF * and # "digits" as well in some applications. They are by no means decimal integers globally and are defined by the Telcos as hex digits. They only *look* like decimal numbers to most of us.

Without a very complex and interlocking scheme of variations, which may change at any moment (e.g. the change in many US cities (including mine) from 7-digit local numbers to mandatory 11-digit dialing for all numbers), and an equally complex method of translating or transmitting these variations to and from an underlying canonical database representation, the user agent is left to figure these things out for itself. And quite properly so, since updating a large database to accommodate the whims of the telephone company seems excessive. In fact, the mandatory 11-digit migration was prompted partly by reluctance to force even more widespread database conversions through an area code split.

Canonical representations with local conversions make sense in many situations. Instead of storing explicit Zip codes, for example, which could conceivably be replaced by "Wing codes" if the post office took a mind to and thought they could save money by the change, a generic postal code bin with local processing allows the US Post Office to migrate (as they have) from Zip to Zip+4 (Carrier Route), to Zip+4+2 (Delivery Point Bar Code), or eventually to DNS-style addressing of individuals no matter where they move in the world without necessarily forcing database redesign. Delivery Point codes should only be used as bar codes and are never (at least for now) represented on the written address, Zip and Zip+4 appear in one format on the alphanumeric address and another format in the bar code. How many variants do we need? One can account for local processing requirements with a locale or usage attribute and leave the data alone.

Surely the Y2K debacle has alerted us to the hidden costs of forcing data into Procrustean beds.

Thank you for your kind attention,

Lee Anne

Received on Wednesday, 20 October 1999 15:23:14 UTC