Re: mapping of XML names into programming language from Paul.V.Biron@kp.org on 2006-02-02 (public-i18n-core@w3.org from January to March 2006)

From: <Paul.V.Biron@kp.org>
Date: Thu, 2 Feb 2006 13:26:40 -0800
To: duerst@it.aoyama.ac.jp
Cc: paul.downey@bt.com, public-i18n-core@w3.org, public-xsd-databinding@w3.org, public-xsd-databinding-request@w3.org
Message-Id: <OF53975D46.E8352786-ON88257109.0072BACE-88257109.0075CC4D@KP.ORG>

> Conversions such as the one you mention from Kanji to Romaji
> have the advantage that the result is still fairly legible,
> but there are various disadvantages:
> - large dictionary needed
> - not deterministic (there is often more than one way to
>    pronounce a Kanji or Kanji combination)
> - language-specific, which means a different solution for
>    each language is needed

To provide context for this question from the databinding WG, our goal is 
to provide guidance to  implementors of databinding toolkits: tools that 
take a schema and produce a set of programming language bindings, e.g., 
Java classes, that know how to manipulate instances conforming to the 
schema.  Most binding tools do something like the following.  Given this 
schema document fragment

<xs:complexType name='MyType'>
        <xs:sequence>
                <xs:element name='child1' type='xs:string'/>
                <xs:element name='child2' type='xs:string' 
maxOccurs='unbounded'/>
        </xs:sequence>
</xs:complexType>

they will produce a class such as:

class MyType
{
        String child1 ;
        List<String> child2 ;
}

where the element and type names have become names in the programming 
language (Java in this case).

The range of characters that are legal for XML names is much wider than 
that supported by many programming languages.  The question is: what 
guidance should we give binding tool implementors about what they should 
do in the face of XML names that contain characters that aren't legal in 
that programming language?

One option is: replace "bad" characters with punctuation, etc.
Another option is : for languages that have something resembling a kanji 
to romanji mapping, automate the mapping (if possible/reasonable).  If 
such automation is not possible/reasonable, perhaps the tool could provide 
a configuration option to allow the user to "manually" specify the mapping 
for the particular names used in the schema.

We were wondering if i18n had any other options they could recommend or 
any advice in general about this problem.

One question I had was whether languages other than CJK have something 
similar to kanji -> romanji?  For instance, do hebrew, greek, thai, etc. 
have this concept?

> - not reversible (there are many Kanji or Kanji combinations
>    that lead to the same Romaji)

That should not be a problem since the binding tool can store the original 
XML name as metadata for each name in the language binding for use in 
serializing instances.

pvb

Received on Thursday, 2 February 2006 21:27:07 UTC