[xmlschema] Multilingual lookup tables from Jack Lindsey on 2003-04-12 (xmlschema-dev@w3.org from April 2003)

From: Jack Lindsey <Jack@Ottawa.com>
Date: Sat, 12 Apr 2003 00:29:12 -0400
To: "Jeni Tennison" <jeni@jenitennison.com>
Cc: <xmlschema-dev@w3.org>
Message-ID: <000a01c300ac$11f5c440$dfa27018@slnt.phub.net.cable.rogers.com>
Jeni:
Many thanks for the advice and the ref to UBL's "Garden of Eden" which
really seems to suit our purposes. Besides, it's always nice when there's a
name for what you want to do ;-)

OK, multilingual code tables:

As a general rule, the XML world seems to favour the transmission of code
value meanings rather than just the code, e.g.
 
<State>New York</State> not <State>NY</State> 

to minimize the chance of mis-interpretation or avoid the necessity for
smarts at the receiving end, I suppose. 

Anyway, Canada is an officially bilingual country where federal systems
always support output in the language of your choice. A language preference
is chosen on a signon screen or web page. Many systems even support language
switching from any screen in the application. As a result you can see field
labels, static text and coded information in the language of your choice,
since in many locations anglophones and francophones work side-by-side. You
are stuck with the language of data entry for user-keyed text, but that may
be a good thing based on my experiences with Altavista or L&H, unless you
are looking to start a random international incident. Apart from English and
French, interoperability with the US and NAFTA might entail Spanish, and
Inuktitut in Nunavut, though that would require UTF-16. 

To support this mandatory requirement for multilingual output from XML data
exchange across our diverse community, we are thinking in terms of the
traditional database code table approach, where a language-independent code,
typically a dumb number except for some international codes, would be
transmitted and expanded into the language of choice on output, e.g. 

1 = Nova Scotia or Nouvelle-Écosse 
2 = British Columbia or Colombie-Britannique 
3 = New Brunswick or Nouveau-Brunswick 

(Just an example because actually our schemas will use ISO Country
Subdivision for this particular one, i.e. CA-BC, US-NY, MX-DUR) 

Obviously this entails signifiancant smarts at either end and centrally
maintained code tables, but we don't see any alternative.
 
??? Comments, experiences, good examples??? 

OK. Schema design and performance considerations when:
(a) value sets can vary from a handful to hundreds, and occasionally a
couple of thousand;
(b) we will have around 100 different code tables to manage;
(c) we want an approach that can be viably used in XSLT, DOM and SAX.

1. XML Schema Elements or Attributes? 

An old chestnut but a critical choice here. In the lookup table examples in
your XSLT books, you and Michael Kay always use attributes for codes.  Some
people say always use attributes for "meta data".  Why? Certainly it's
very attractive because it's compact and more efficient.  But some folk seem
very anti-attribute, maybe because of additional limitations in DTD and
earlier associated tools?  Then there's always the fear with attributes that
changing requirements will force their conversion to elements resulting in
unnecessary disruption in the community, so elements are a safer strategy.
For instance, in the example below, the need for multiple medical
conditions and disabilities might not have been anticipated first time
around.

Attribute Style Example:

<Being beingSerialNumber="0002345" genderID="3" speciesID="42"> 
  <BeingBirthDate>1980-04-13</BeingBirthDate> 
  <MedicalConditionNarrative medicalConditionID="23">Not responding to treatment.</MedicalConditionNarrative> 
  <MedicalConditionNarrative medicalConditionID="16">Clearing up nicely.</MedicalConditionNarrative> 
  <Disability disabilityID="17"/> 
  <Disability disabilityID="24"/> 
</Being> 

By their nature, beingSerialNumber, genderID, speciesID could never occur
multiply per Being instance.  A Being can have multiple medical conditions
requiring a code plus text.  But Disabilities can occur multiply per Being
with no data except the code, resulting in the mixed empty element-attribute
approach which is the worst performer, according to Scott Bonneau (XML
Design handbook, p. 43), although in our model the number of occurrences in
these situations would usually be very low. 

Element Style Example:

<Being>
  <BeingSerialNumber>0002345</BeingSerialNumber> 
  <GenderID>3</GenderID> 
  <SpeciesID>42</SpeciesID> 
  <BeingBirthDate>1980-04-13</BeingBirthDate> 
  <MedicalCondition>
    <MedicalConditionID>23</MedicalConditionID>
    <MedicalConditionNarrative>Not responding to treatment.</MedicalConditionNarrative> 
  </MedicalCondition>
  <MedicalCondition>
    <MedicalConditionID>16</MedicalConditionID>
    <MedicalConditionNarrative>Clearing up nicely.</MedicalConditionNarrative> 
  </MedicalCondition>
  <Disability>17</Disability>
  <Disability>24</Disability>  
</Being> 

My personal XML schema style started off very attribute-heavy, swung to
almost attribute-free, and now I am looking for a rational equilibrium!

??? Comments, opinions??? 

2. Code Table Design considerations - which make sense?
(a) Use XML files for max reuse and update independence, i.e. to minimize 
schema / XSLT changes, etc. 
(b) One language per XML file to halve their size because most users rarely
switch to the other official language, especially in mid-stream. 
(c) One XML file per code table to keep the size down (I assume the whole
document would have to be represented in memory in XSLT????
OR the opposite:
(d) Combine lots of commonly used tables in the same XML file to avoid the
overhead of opening lots of files per transaction.
(e) Where a table is accessed typically once, and no more than 10 times per
transaction, XSLT key contructs would be a waste of time???
(f) Instead of matching strings, use sequential, dumb number code values to
directly reference the required entry in the code table by position. Old
entries could not be removed and an attribute would be needed to indicate a
status of retired, but would this offer a significant performance benefit with 
large code tables??? 

??? comments, other suggestions ???

Cheers Jack

-------------------------------
"Smart data structures and dumb code work alot better than the other way around."

-- Eric S. Raymond, "The Cathedral and the Bazaar"
Received on Saturday, 12 April 2003 00:32:49 UTC