- From: Uros Milosevic <uros.milosevic@pupin.rs>
- Date: Thu, 04 Sep 2014 16:45:09 +0200
- To: public-sparql-dev@w3.org
- Message-ID: <d0af35326b349aaaae2a33286c93bb43@pupin.rs>
Hi all, Summing up my experiences after three years of work on LOD2 (EU FP7 project [1]), and some time spent with the DBpedia extraction framework, I've come to a conclusion on the subject of Linked Data and digraphic languages (i.e. those that use multiple writing systems) [2,3] that I find important and would like to share with all of you. As some of you may (or may not) know, Serbian, unlike any other language in Europe, is digraphic in nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing information in any modern knowledge base, but can often be a major obstacle for information retrieval. For instance, most Serbs rely on the Latin alphabet for colloquial communication/interaction on the Web. That means a large portion of the information is (and, often, expected to be) encoded in Latin-2. And, yet, most of the information on the Serbian Wikipedia is encoded in Serbian Cyrillic (the alphabets are considered equal in Serbian, so the choice is only a matter of preference). So, unless your software performs romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. vice-versa) on-the-fly, at retrieval time (Wikipedia does it), many attempts at information extraction will be doomed to fail. Example: Imagine searching for "parser" when that particular string in your knowledge store is encoded only in Cyrillic, as "парсер". You see, this is not that big of a problem when working with local databases. However, on the Linked Data Web, where, most of the time, you don't know the alphabet the information you need is stored in, this is a huge pitfall. Example: Imagine querying multiple/multilingual graphs. This directly affects common tasks such as keyword search, label-based SPARQL querying, named entity recognition, etc. After noticing this problem while working on the Serbian version of DBpedia, my original idea was to improve the DBpedia Extraction Framework. Now that I know the actual scale and possible repercussions of the problem (for example, Hindustani alone, which is also digraphic, is spoken by some 500 million (!) people worldwide), I see it needs to be addressed on an entirely different level. Therefore, I'm wondering what it would take to take this to the right people and, ultimately, implement a global (i.e. SPARQL) mechanism for dealing with digraphia. I'm not sure if this is the right place to start, but if not, then, hopefully, some of you will point me in the right direction. Thanks in advance! Best, Uroš Milošević ------------------------------ Institute Mihajlo Pupin 15 Volgina, 11060 Belgrade, Serbia Phone: +381 11 6771398, Fax: +381 11 6776583 [1] http://lod2.eu [1] [2] http://en.wikipedia.org/wiki/Digraphia [3] http://www.omniglot.com/language/articles/digraphia/ [2] Links: ------ [1] http://lod2.eu [2] http://www.omniglot.com/language/articles/digraphia/
Received on Thursday, 4 September 2014 14:48:13 UTC