- From: Mark Davis <mark.davis@icu-project.org>
- Date: Tue, 07 Feb 2006 18:47:53 -0800
- To: Felix Sasaki <fsasaki@w3.org>
- CC: Paul.V.Biron@kp.org, duerst@it.aoyama.ac.jp, paul.downey@bt.com, public-i18n-core@w3.org, public-xsd-databinding@w3.org, public-xsd-databinding-request@w3.org
> For all these languages you have transliteration schemes which > describe how to convert a string in the original script to a version > which uses only latin letters. I think nearly for one of these > languages there is a "standardized", totally accepted scheme. But it > seems that for your purpose it should be enough to choose just one > scheme. This is not really the case; most non-Latin to Latin transliterations vary quite widely. Путин ↔ Putin, Poutine, ... Горбачёв ↔ Gorbachev, Gorbacev, Gorbatchev, Gorbačëv, Gorbachov, Gorbatsov, Gorbatschow, ... Mark Felix Sasaki wrote: > > Hi Paul, > > Sorry for the late follow-up. Just a remark to your question below. > > On Fri, 03 Feb 2006 06:26:40 +0900, <Paul.V.Biron@kp.org> wrote: > >> >>> Conversions such as the one you mention from Kanji to Romaji >>> have the advantage that the result is still fairly legible, >>> but there are various disadvantages: >>> - large dictionary needed >>> - not deterministic (there is often more than one way to >>> pronounce a Kanji or Kanji combination) >>> - language-specific, which means a different solution for >>> each language is needed >> >> To provide context for this question from the databinding WG, our >> goal is >> to provide guidance to implementors of databinding toolkits: tools that >> take a schema and produce a set of programming language bindings, e.g., >> Java classes, that know how to manipulate instances conforming to the >> schema. Most binding tools do something like the following. Given this >> schema document fragment >> >> <xs:complexType name='MyType'> >> <xs:sequence> >> <xs:element name='child1' type='xs:string'/> >> <xs:element name='child2' type='xs:string' >> maxOccurs='unbounded'/> >> </xs:sequence> >> </xs:complexType> >> >> they will produce a class such as: >> >> class MyType >> { >> String child1 ; >> List<String> child2 ; >> } >> >> where the element and type names have become names in the programming >> language (Java in this case). >> >> The range of characters that are legal for XML names is much wider than >> that supported by many programming languages. The question is: what >> guidance should we give binding tool implementors about what they should >> do in the face of XML names that contain characters that aren't legal in >> that programming language? >> >> One option is: replace "bad" characters with punctuation, etc. >> Another option is : for languages that have something resembling a kanji >> to romanji mapping, automate the mapping (if possible/reasonable). If >> such automation is not possible/reasonable, perhaps the tool could >> provide >> a configuration option to allow the user to "manually" specify the >> mapping >> for the particular names used in the schema. >> >> We were wondering if i18n had any other options they could recommend or >> any advice in general about this problem. >> >> One question I had was whether languages other than CJK have something >> similar to kanji -> romanji? For instance, do hebrew, greek, thai, etc. >> have this concept? > > For all these languages you have transliteration schemes which > describe how to convert a string in the original script to a version > which uses only latin letters. I think nearly for one of these > languages there is a "standardized", totally accepted scheme. But it > seems that for your purpose it should be enough to choose just one > scheme. > > -- Felix > >> >>> - not reversible (there are many Kanji or Kanji combinations >>> that lead to the same Romaji) >> >> That should not be a problem since the binding tool can store the >> original >> XML name as metadata for each name in the language binding for use in >> serializing instances. >> >> pvb >> > > > > >
Received on Wednesday, 8 February 2006 02:47:58 UTC