- From: Jack Lindsey <Jack@Ottawa.com>
- Date: Sat, 12 Apr 2003 00:29:12 -0400
- To: "Jeni Tennison" <jeni@jenitennison.com>
- Cc: <xmlschema-dev@w3.org>
- Message-ID: <000a01c300ac$11f5c440$dfa27018@slnt.phub.net.cable.rogers.com>
Jeni: Many thanks for the advice and the ref to UBL's "Garden of Eden" which really seems to suit our purposes. Besides, it's always nice when there's a name for what you want to do ;-) OK, multilingual code tables: As a general rule, the XML world seems to favour the transmission of code value meanings rather than just the code, e.g. <State>New York</State> not <State>NY</State> to minimize the chance of mis-interpretation or avoid the necessity for smarts at the receiving end, I suppose. Anyway, Canada is an officially bilingual country where federal systems always support output in the language of your choice. A language preference is chosen on a signon screen or web page. Many systems even support language switching from any screen in the application. As a result you can see field labels, static text and coded information in the language of your choice, since in many locations anglophones and francophones work side-by-side. You are stuck with the language of data entry for user-keyed text, but that may be a good thing based on my experiences with Altavista or L&H, unless you are looking to start a random international incident. Apart from English and French, interoperability with the US and NAFTA might entail Spanish, and Inuktitut in Nunavut, though that would require UTF-16. To support this mandatory requirement for multilingual output from XML data exchange across our diverse community, we are thinking in terms of the traditional database code table approach, where a language-independent code, typically a dumb number except for some international codes, would be transmitted and expanded into the language of choice on output, e.g. 1 = Nova Scotia or Nouvelle-Écosse 2 = British Columbia or Colombie-Britannique 3 = New Brunswick or Nouveau-Brunswick (Just an example because actually our schemas will use ISO Country Subdivision for this particular one, i.e. CA-BC, US-NY, MX-DUR) Obviously this entails signifiancant smarts at either end and centrally maintained code tables, but we don't see any alternative. ??? Comments, experiences, good examples??? OK. Schema design and performance considerations when: (a) value sets can vary from a handful to hundreds, and occasionally a couple of thousand; (b) we will have around 100 different code tables to manage; (c) we want an approach that can be viably used in XSLT, DOM and SAX. 1. XML Schema Elements or Attributes? An old chestnut but a critical choice here. In the lookup table examples in your XSLT books, you and Michael Kay always use attributes for codes. Some people say always use attributes for "meta data". Why? Certainly it's very attractive because it's compact and more efficient. But some folk seem very anti-attribute, maybe because of additional limitations in DTD and earlier associated tools? Then there's always the fear with attributes that changing requirements will force their conversion to elements resulting in unnecessary disruption in the community, so elements are a safer strategy. For instance, in the example below, the need for multiple medical conditions and disabilities might not have been anticipated first time around. Attribute Style Example: <Being beingSerialNumber="0002345" genderID="3" speciesID="42"> <BeingBirthDate>1980-04-13</BeingBirthDate> <MedicalConditionNarrative medicalConditionID="23">Not responding to treatment.</MedicalConditionNarrative> <MedicalConditionNarrative medicalConditionID="16">Clearing up nicely.</MedicalConditionNarrative> <Disability disabilityID="17"/> <Disability disabilityID="24"/> </Being> By their nature, beingSerialNumber, genderID, speciesID could never occur multiply per Being instance. A Being can have multiple medical conditions requiring a code plus text. But Disabilities can occur multiply per Being with no data except the code, resulting in the mixed empty element-attribute approach which is the worst performer, according to Scott Bonneau (XML Design handbook, p. 43), although in our model the number of occurrences in these situations would usually be very low. Element Style Example: <Being> <BeingSerialNumber>0002345</BeingSerialNumber> <GenderID>3</GenderID> <SpeciesID>42</SpeciesID> <BeingBirthDate>1980-04-13</BeingBirthDate> <MedicalCondition> <MedicalConditionID>23</MedicalConditionID> <MedicalConditionNarrative>Not responding to treatment.</MedicalConditionNarrative> </MedicalCondition> <MedicalCondition> <MedicalConditionID>16</MedicalConditionID> <MedicalConditionNarrative>Clearing up nicely.</MedicalConditionNarrative> </MedicalCondition> <Disability>17</Disability> <Disability>24</Disability> </Being> My personal XML schema style started off very attribute-heavy, swung to almost attribute-free, and now I am looking for a rational equilibrium! ??? Comments, opinions??? 2. Code Table Design considerations - which make sense? (a) Use XML files for max reuse and update independence, i.e. to minimize schema / XSLT changes, etc. (b) One language per XML file to halve their size because most users rarely switch to the other official language, especially in mid-stream. (c) One XML file per code table to keep the size down (I assume the whole document would have to be represented in memory in XSLT???? OR the opposite: (d) Combine lots of commonly used tables in the same XML file to avoid the overhead of opening lots of files per transaction. (e) Where a table is accessed typically once, and no more than 10 times per transaction, XSLT key contructs would be a waste of time??? (f) Instead of matching strings, use sequential, dumb number code values to directly reference the required entry in the code table by position. Old entries could not be removed and an attribute would be needed to indicate a status of retired, but would this offer a significant performance benefit with large code tables??? ??? comments, other suggestions ??? Cheers Jack ------------------------------- "Smart data structures and dumb code work alot better than the other way around." -- Eric S. Raymond, "The Cathedral and the Bazaar"
Received on Saturday, 12 April 2003 00:32:49 UTC