Digraphia on the Linked Data Web from Uroš Milošević on 2014-08-19 (public-bpmlod@w3.org from August 2014)

From: Uroš Milošević <uros.milosevic@pupin.rs>
Date: Tue, 19 Aug 2014 15:49:48 +0200
To: <public-bpmlod@w3.org>
Message-ID: <025001cfbbb4$7159c650$540d52f0$@milosevic@pupin.rs>

Hi all,

Summing up my experiences after three years of work on LOD2 (EU FP7 project), and some time spent with the DBpedia extraction framework, I’ve come to some conclusions related to Linked Data and digraphic languages (i.e. those that use multiple writing systems) [1,2] I would like to share with you.

As some of you may (or may not) know, Serbian, unlike any other language in Europe, is digraphic in nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing information in any modern knowledge base, but can often be a major obstacle for information retrieval.

For instance, most Serbs rely on the Latin alphabet for colloquial communication/interaction on the Web. That means a large portion of the information is (and, often, expected to be) encoded in Latin-2. And, yet, most of the information on the Serbian Wikipedia is encoded in Serbian Cyrillic (the alphabets are considered equal in Serbian, so the choice is only a matter of preference). So, unless your software performs romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. vice-versa) on-the-fly, at retrieval time (Wikipedia does it), many attempts at information extraction will be doomed to fail (imagine searching for “parser” when that particular string in your knowledge store is encoded only in Cyrillic, as “парсер”).

This is not that big of a problem when working with local databases. However, on the Linked Data Web, where, most of the time, you don’t know the alphabet the information you need is stored in, this is a huge pitfall. This directly affects common tasks such as keyword search, label-based SPARQL querying, named entity recognition, etc.

After noticing this problem while working on the Serbian version of DBpedia, my original idea was to improve the DBpedia Extraction Framework. Now that I know the actual scale and possible repercussions of the problem (for example, Hindustani alone, which is also digraphic, is spoken by some 500 million people worldwide), I see it needs to be addressed on an entirely different level.

Therefore, I’m wondering what it would take to take this to the right people and, ultimately, implement a global (i.e. SPARQL) mechanism for dealing with digraphia. I’m not sure if this is the right place to start, but if not, then, hopefully, some of you will point me in the right direction. Thanks in advance!

Best,

Uroš Milošević

------------------------------

Institute Mihajlo Pupin

15 Volgina, 11060 Belgrade, Serbia

Phone: +381 65 3220223, Fax: +381 11 6776583

[1] http://en.wikipedia.org/wiki/Digraphia

[2] http://www.omniglot.com/language/articles/digraphia/

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

Received on Wednesday, 20 August 2014 07:02:09 UTC