- From: John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de>
- Date: Wed, 10 Sep 2014 15:36:13 +0200
- To: uros.milosevic@pupin.rs
- Cc: public-bpmlod@w3.org
- Message-ID: <CAC5njqpoU_is4vOY5yJ66r4C3XeeyOxgWOkKESBUwxqL5Ospfg@mail.gmail.com>
Hi, "Redundant" means that there used to be two primary language subtags "sr-Latn" and "sr-Cyrl" both of which are now replaced with the combination of the language subtag "sr" and the script subtag "Latn" or "Cyrl" to give "sr-Latn" and "sr-Cyrl"! (see 2.7 of RFC 4645 for an explanation) You are right about transliteration not being the same as transcription, although I will note that this is not really the general case, for example in Chinese every traditional character (as used in HK and Taiwan) maps to a single simplified character (as used in PRC), but the reverse is not true. As such, looking into both true transliteration and transcription may be interesting, and drawing a boundary between the two may be challenging. Regards, John On Wed, Sep 10, 2014 at 12:53 PM, Uros Milosevic <uros.milosevic@pupin.rs> wrote: > Hi John, > > Thanks for looking into the issue, as well as for all your comments and > suggestions. > > Regarding the language tag solution you mentioned - the IANA language > subtag registry [1] says the two are "redundant" (not sure what this means, > exactly, to be honest), and I can imagine that establishing the correct > use of such tags as a best practice throughout the Linked Data Web would be > quite a challenge. Moreover, that's also under the assumption that we have > direct control over all those data sources, which is, unfortunately, not > the case, and there's already so much information out there that relies on > the "simple" form of the tag, i.e. "sr" (or similar, for other digraphic > languages). > > Now, the real problem here are not the SPARQL or language experts, who are > aware of the problem and know exactly what to look for, nor is it the case > of querying a dataset encoded in a single alphabet/language. The problem is > best illustrated in the case of "common" (i.e. Internet) users, who are > presented with nothing but a search box that's supposed to query a large > multilingual graph, or multiple graphs/datasets. Another example is > presenting a user with an annotated piece of text (e.g. provided through > Named Entity Recognition), then matching those words and phrases against a > *custom* list of knowledge sources (e.g. SPARQL endpoints; a scenario which > I, actually, encountered just recently). > > Finally, I think you're confusing transliteration with transcription. > Unlike transcription, "transliteration is not concerned with representing > the phonemics of the original: it only strives to represent the characters > accurately" [2]. In Serbian, transliteration works in accordance with a > very clear set of rules, whereas transcription is, well, open to > interpretation. That's why there would be no real "computational cost", as > there should be only one possible transliteration output (for Serbian, at > least). So, transliterating "John" from your example would result in > "Јохн", but the general idea, i.e. the filter function you proposed is > exactly what I believe Linked Data needs. Also, although transliteration is > possible for any language, I think only the languages that are officially > digraphic are in need of such a mechanism (Serbian is the only European > language that's officially digraphic, meaning you'll find entire documents, > books, websites etc. in any of the two scripts). > > I'm really willing to look into the issue and, even, provide that first > implementation of the transliteration mechanism (for Virtuoso, for > instance), but I wouldn't want to be alone in this. That's why I was hoping > to spark some community discussion, engagement and support. Thanks, again! > > Best, > Uroš > > > > [1] > http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry > > [2] http://en.wikipedia.org/wiki/Transliteration > > > > On 10.9.2014 11:21, John P. McCrae wrote: > > Hi Uros, > > Some thoughts on the issue... > > Digraphia is an interesting challenge, but the solutions mostly boil down > to treating the two scripts as two separate language. In this case RDF > works fine assuming you tag the literals correctly as sr (or sr-Cyrl) for > Serbian in Cyrillic and sr-Latn for Serbian in Latin, you can and generally > do omit the script tag for the official script, which in Serbia is > Cyrillic. In this case, the problem of query is now no harder than > retrieving a literal by keywords in English and French. > > Transliteration, while as you noted deployed in many existing web search > systems, still has several challenges, namely computational cost*, accuracy > and availability for all languages. As such it seems unlikely, that it > would be something that could be built into SPARQL systems in general. It > is however quite possible that it could be introduced to some systems as an > extra function, e.g., > > SELECT ?person { > ?person foaf:name ?name . > FILTER transliteration(?name, "John", "Latn") > } > > Could return: > > <> foaf:name "Джон"@sr > > It could be an interesting project to implement such a thing as a specific > system. I suspect such a function could not be standardized by the SPARQL > WG, but if there were a determined group of people in BPM-LOD willing to > provide reference implementations, we could in the context of the group > attempt to provide an advisory on the implementation of such a function. > > Regards, > John McCrae > > * By computational cost I refer not so much to the cost of > transliteration, but the generation of multiple possible transliteration > candidates that would need to be checked in the database. For example, in > Greek the letters Eta, Iota and Upsilon are all mapped to the Latin letter > "i" in modern Greek transliteration. > > > > On Tue, Sep 2, 2014 at 3:40 PM, Uroš Milošević <uros.milosevic@pupin.rs> > wrote: > >> Hi all, >> >> Perhaps now, with the summer holiday season officially over, my message >> will get through to the right people. :) I don’t mind trying elsewhere, I >> just thought that a W3C community group dealing with “Best Practices for >> Multilingual Linked Open Data” would be a good place to start. >> Thanks, again. >> >> Best, >> Uroš Milošević >> >> >> >> *From:* Uroš Milošević [mailto:uros.milosevic@pupin.rs] >> *Sent:* Tuesday, August 19, 2014 3:50 PM >> *To:* public-bpmlod@w3.org >> *Subject:* Digraphia on the Linked Data Web >> >> >> >> Hi all, >> >> Summing up my experiences after three years of work on LOD2 (EU FP7 >> project), and some time spent with the DBpedia extraction framework, I’ve >> come to some conclusions related to Linked Data and digraphic languages >> (i.e. those that use multiple writing systems) [1,2] I would like to share >> with you. >> >> As some of you may (or may not) know, Serbian, unlike any other language >> in Europe, is digraphic in nature, officially supporting both (Serbian) >> Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing >> information in any modern knowledge base, but can often be a major obstacle >> for information retrieval. >> >> For instance, most Serbs rely on the Latin alphabet for colloquial >> communication/interaction on the Web. That means a large portion of the >> information is (and, often, expected to be) encoded in Latin-2. And, yet, >> most of the information on the Serbian Wikipedia is encoded in Serbian >> Cyrillic (the alphabets are considered equal in Serbian, so the choice is >> only a matter of preference). So, unless your software performs >> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. >> vice-versa) on-the-fly, at retrieval time (Wikipedia does it), many >> attempts at information extraction will be doomed to fail (imagine >> searching for “parser” when that particular string in your knowledge store >> is encoded only in Cyrillic, as “парсер”). >> >> This is not that big of a problem when working with local databases. >> However, on the Linked Data Web, where, most of the time, you don’t know >> the alphabet the information you need is stored in, this is a huge pitfall. >> This directly affects common tasks such as keyword search, label-based >> SPARQL querying, named entity recognition, etc. >> >> After noticing this problem while working on the Serbian version of >> DBpedia, my original idea was to improve the DBpedia Extraction Framework. >> Now that I know the actual scale and possible repercussions of the problem >> (for example, Hindustani alone, which is also digraphic, is spoken by some >> 500 million people worldwide), I see it needs to be addressed on an >> entirely different level. >> >> Therefore, I’m wondering what it would take to take this to the right >> people and, ultimately, implement a global (i.e. SPARQL) mechanism for >> dealing with digraphia. I’m not sure if this is the right place to start, >> but if not, then, hopefully, some of you will point me in the right >> direction. Thanks in advance! >> >> >> >> Best, >> >> Uroš Milošević >> >> ------------------------------ >> >> Institute Mihajlo Pupin >> >> 15 Volgina, 11060 Belgrade, Serbia >> >> Phone: +381 65 3220223, Fax: +381 11 6776583 >> >> >> >> [1] http://en.wikipedia.org/wiki/Digraphia >> >> [2] http://www.omniglot.com/language/articles/digraphia/ >> >> >> ------------------------------ >> >> <http://www.avast.com/> >> >> This email is free from viruses and malware because avast! Antivirus >> <http://www.avast.com/> protection is active. >> >> >> >> >> ------------------------------ >> <http://www.avast.com/> >> >> This email is free from viruses and malware because avast! Antivirus >> <http://www.avast.com/> protection is active. >> > >
Attachments
- image/gif attachment: blocked.gif
Received on Wednesday, 10 September 2014 13:36:47 UTC