- From: Uros Milosevic <uros.milosevic@pupin.rs>
- Date: Wed, 10 Sep 2014 12:53:22 +0200
- To: "John P. McCrae" <jmccrae@cit-ec.uni-bielefeld.de>
- Cc: public-bpmlod@w3.org, johnmccrae@gmail.com
- Message-ID: <fb508a5d3f8b0e76109a02ae8578dbcb@pupin.rs>
Hi John, Thanks for looking into the issue, as well as for all your comments and suggestions. Regarding the language tag solution you mentioned - the IANA language subtag registry [1] says the two are "redundant" (not sure what this means, exactly, to be honest), and I can imagine that establishing the correct use of such tags as a best practice throughout the Linked Data Web would be quite a challenge. Moreover, that's also under the assumption that we have direct control over all those data sources, which is, unfortunately, not the case, and there's already so much information out there that relies on the "simple" form of the tag, i.e. "sr" (or similar, for other digraphic languages). Now, the real problem here are not the SPARQL or language experts, who are aware of the problem and know exactly what to look for, nor is it the case of querying a dataset encoded in a single alphabet/language. The problem is best illustrated in the case of "common" (i.e. Internet) users, who are presented with nothing but a search box that's supposed to query a large multilingual graph, or multiple graphs/datasets. Another example is presenting a user with an annotated piece of text (e.g. provided through Named Entity Recognition), then matching those words and phrases against a *custom* list of knowledge sources (e.g. SPARQL endpoints; a scenario which I, actually, encountered just recently). Finally, I think you're confusing transliteration with transcription. Unlike transcription, "transliteration is not concerned with representing the phonemics of the original: it only strives to represent the characters accurately" [2]. In Serbian, transliteration works in accordance with a very clear set of rules, whereas transcription is, well, open to interpretation. That's why there would be no real "computational cost", as there should be only one possible transliteration output (for Serbian, at least). So, transliterating "John" from your example would result in "Јохн", but the general idea, i.e. the filter function you proposed is exactly what I believe Linked Data needs. Also, although transliteration is possible for any language, I think only the languages that are officially digraphic are in need of such a mechanism (Serbian is the only European language that's officially digraphic, meaning you'll find entire documents, books, websites etc. in any of the two scripts). I'm really willing to look into the issue and, even, provide that first implementation of the transliteration mechanism (for Virtuoso, for instance), but I wouldn't want to be alone in this. That's why I was hoping to spark some community discussion, engagement and support. Thanks, again! Best, Uroš [1] http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry [2] http://en.wikipedia.org/wiki/Transliteration On 10.9.2014 11:21, John P. McCrae wrote: > Hi Uros, > > Some thoughts on the issue... > Digraphia is an interesting challenge, but the solutions mostly boil down to treating the two scripts as two separate language. In this case RDF works fine assuming you tag the literals correctly as sr (or sr-Cyrl) for Serbian in Cyrillic and sr-Latn for Serbian in Latin, you can and generally do omit the script tag for the official script, which in Serbia is Cyrillic. In this case, the problem of query is now no harder than retrieving a literal by keywords in English and French. > > Transliteration, while as you noted deployed in many existing web search systems, still has several challenges, namely computational cost*, accuracy and availability for all languages. As such it seems unlikely, that it would be something that could be built into SPARQL systems in general. It is however quite possible that it could be introduced to some systems as an extra function, e.g., > > SELECT ?person { ?person foaf:name ?name . FILTER transliteration(?name, "John", "Latn") > } > > Could return: > <> foaf:name "Джон"@sr > > It could be an interesting project to implement such a thing as a specific system. I suspect such a function could not be standardized by the SPARQL WG, but if there were a determined group of people in BPM-LOD willing to provide reference implementations, we could in the context of the group attempt to provide an advisory on the implementation of such a function. > > Regards, > John McCrae > > * By computational cost I refer not so much to the cost of transliteration, but the generation of multiple possible transliteration candidates that would need to be checked in the database. For example, in Greek the letters Eta, Iota and Upsilon are all mapped to the Latin letter "i" in modern Greek transliteration. > > On Tue, Sep 2, 2014 at 3:40 PM, Uroš Milošević <uros.milosevic@pupin.rs> wrote: > >> Hi all, >> >> Perhaps now, with the summer holiday season officially over, my message will get through to the right people. :) I don't mind trying elsewhere, I just thought that a W3C community group dealing with "Best Practices for Multilingual Linked Open Data" would be a good place to start. >> Thanks, again. >> >> Best, >> Uroš Milošević >> >> FROM: Uroš Milošević [mailto:uros.milosevic@pupin.rs] >> SENT: Tuesday, August 19, 2014 3:50 PM >> TO: public-bpmlod@w3.org >> SUBJECT: Digraphia on the Linked Data Web >> >> Hi all, >> >> Summing up my experiences after three years of work on LOD2 (EU FP7 project), and some time spent with the DBpedia extraction framework, I've come to some conclusions related to Linked Data and digraphic languages (i.e. those that use multiple writing systems) [1,2] I would like to share with you. >> >> As some of you may (or may not) know, Serbian, unlike any other language in Europe, is digraphic in nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing information in any modern knowledge base, but can often be a major obstacle for information retrieval. >> >> For instance, most Serbs rely on the Latin alphabet for colloquial communication/interaction on the Web. That means a large portion of the information is (and, often, expected to be) encoded in Latin-2. And, yet, most of the information on the Serbian Wikipedia is encoded in Serbian Cyrillic (the alphabets are considered equal in Serbian, so the choice is only a matter of preference). So, unless your software performs romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. vice-versa) on-the-fly, at retrieval time (Wikipedia does it), many attempts at information extraction will be doomed to fail (imagine searching for "parser" when that particular string in your knowledge store is encoded only in Cyrillic, as "парсер"). >> >> This is not that big of a problem when working with local databases. However, on the Linked Data Web, where, most of the time, you don't know the alphabet the information you need is stored in, this is a huge pitfall. This directly affects common tasks such as keyword search, label-based SPARQL querying, named entity recognition, etc. >> >> After noticing this problem while working on the Serbian version of DBpedia, my original idea was to improve the DBpedia Extraction Framework. Now that I know the actual scale and possible repercussions of the problem (for example, Hindustani alone, which is also digraphic, is spoken by some 500 million people worldwide), I see it needs to be addressed on an entirely different level. >> >> Therefore, I'm wondering what it would take to take this to the right people and, ultimately, implement a global (i.e. SPARQL) mechanism for dealing with digraphia. I'm not sure if this is the right place to start, but if not, then, hopefully, some of you will point me in the right direction. Thanks in advance! >> >> Best, >> >> Uroš Milošević >> >> ------------------------------ >> >> Institute Mihajlo Pupin >> >> 15 Volgina, 11060 Belgrade, Serbia >> >> Phone: +381 65 3220223 [1], Fax: +381 11 6776583 [2] >> >> [1] http://en.wikipedia.org/wiki/Digraphia [3] >> >> [2] http://www.omniglot.com/language/articles/digraphia/ [4] >> >> ------------------------- >> >> [5] >> >> This email is free from viruses and malware because avast! Antivirus [5] protection is active. >> >> ------------------------- >> >> [5] >> >> This email is free from viruses and malware because avast! Antivirus [5] protection is active. Links: ------ [1] tel:%2B381%2065%203220223 [2] tel:%2B381%2011%206776583 [3] http://en.wikipedia.org/wiki/Digraphia [4] http://www.omniglot.com/language/articles/digraphia/ [5] http://www.avast.com/
Received on Wednesday, 10 September 2014 10:56:44 UTC