Re: mapping of XML names into programming language from Felix Sasaki on 2006-02-08 (public-xsd-databinding@w3.org from February 2006)

From: Felix Sasaki <fsasaki@w3.org>
Date: Wed, 08 Feb 2006 11:55:20 +0900
To: "Mark Davis" <mark.davis@icu-project.org>
Cc: Paul.V.Biron@kp.org, duerst@it.aoyama.ac.jp, paul.downey@bt.com, public-i18n-core@w3.org, public-xsd-databinding@w3.org, public-xsd-databinding-request@w3.org
Message-ID: <op.s4mvqispx1753t@ibm-60d333fc0ec.mag.keio.ac.jp>

On Wed, 08 Feb 2006 11:47:53 +0900, Mark Davis  
<mark.davis@icu-project.org> wrote:

>
>> For all these languages you have transliteration schemes which describe  
>> how to convert a string in the original script to a version which uses  
>> only latin letters. I think nearly for one of these languages there is  
>> a "standardized", totally accepted scheme. But it seems that for your  
>> purpose it should be enough to choose just one scheme.
> This is not really the case; most non-Latin to Latin transliterations  
> vary quite widely.
>
> Путин ↔ Putin, Poutine, ...
> Горбачёв ↔ Gorbachev, Gorbacev, Gorbatchev, Gorbačëv, Gorbachov,  
> Gorbatsov, Gorbatschow, ...

sorry, [I think nearly for one of these] should have been [I think nearly  
for *n*one of these]

Felix

>
> Mark
>
> Felix Sasaki wrote:
>>
>> Hi Paul,
>>
>> Sorry for the late follow-up. Just a remark to your question below.
>>
>> On Fri, 03 Feb 2006 06:26:40 +0900, <Paul.V.Biron@kp.org> wrote:
>>
>>>
>>>> Conversions such as the one you mention from Kanji to Romaji
>>>> have the advantage that the result is still fairly legible,
>>>> but there are various disadvantages:
>>>> - large dictionary needed
>>>> - not deterministic (there is often more than one way to
>>>>    pronounce a Kanji or Kanji combination)
>>>> - language-specific, which means a different solution for
>>>>    each language is needed
>>>
>>> To provide context for this question from the databinding WG, our goal  
>>> is
>>> to provide guidance to  implementors of databinding toolkits: tools  
>>> that
>>> take a schema and produce a set of programming language bindings, e.g.,
>>> Java classes, that know how to manipulate instances conforming to the
>>> schema.  Most binding tools do something like the following.  Given  
>>> this
>>> schema document fragment
>>>
>>> <xs:complexType name='MyType'>
>>>         <xs:sequence>
>>>                 <xs:element name='child1' type='xs:string'/>
>>>                 <xs:element name='child2' type='xs:string'
>>> maxOccurs='unbounded'/>
>>>         </xs:sequence>
>>> </xs:complexType>
>>>
>>> they will produce a class such as:
>>>
>>> class MyType
>>> {
>>>         String child1 ;
>>>         List<String> child2 ;
>>> }
>>>
>>> where the element and type names have become names in the programming
>>> language (Java in this case).
>>>
>>> The range of characters that are legal for XML names is much wider than
>>> that supported by many programming languages.  The question is: what
>>> guidance should we give binding tool implementors about what they  
>>> should
>>> do in the face of XML names that contain characters that aren't legal  
>>> in
>>> that programming language?
>>>
>>> One option is: replace "bad" characters with punctuation, etc.
>>> Another option is : for languages that have something resembling a  
>>> kanji
>>> to romanji mapping, automate the mapping (if possible/reasonable).  If
>>> such automation is not possible/reasonable, perhaps the tool could  
>>> provide
>>> a configuration option to allow the user to "manually" specify the  
>>> mapping
>>> for the particular names used in the schema.
>>>
>>> We were wondering if i18n had any other options they could recommend or
>>> any advice in general about this problem.
>>>
>>> One question I had was whether languages other than CJK have something
>>> similar to kanji -> romanji?  For instance, do hebrew, greek, thai,  
>>> etc.
>>> have this concept?
>>
>> For all these languages you have transliteration schemes which describe  
>> how to convert a string in the original script to a version which uses  
>> only latin letters. I think nearly for one of these languages there is  
>> a "standardized", totally accepted scheme. But it seems that for your  
>> purpose it should be enough to choose just one scheme.
>>
>> -- Felix
>>
>>>
>>>> - not reversible (there are many Kanji or Kanji combinations
>>>>    that lead to the same Romaji)
>>>
>>> That should not be a problem since the binding tool can store the  
>>> original
>>> XML name as metadata for each name in the language binding for use in
>>> serializing instances.
>>>
>>> pvb
>>>
>>
>>
>>
>>
>>
>

Received on Wednesday, 8 February 2006 02:55:39 UTC