Digraphia on the Linked Data Web

 

Hi all,

Summing up my experiences after three years of work on LOD2 (EU FP7
project [1]), and some time spent with the DBpedia extraction framework,
I've come to a conclusion on the subject of Linked Data and digraphic
languages (i.e. those that use multiple writing systems) [2,3] that I
find important and would like to share with all of you.

As some of you may (or may not) know, Serbian, unlike any other language
in Europe, is digraphic in nature, officially supporting both (Serbian)
Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing
information in any modern knowledge base, but can often be a major
obstacle for information retrieval.

For instance, most Serbs rely on the Latin alphabet for colloquial
communication/interaction on the Web. That means a large portion of the
information is (and, often, expected to be) encoded in Latin-2. And,
yet, most of the information on the Serbian Wikipedia is encoded in
Serbian Cyrillic (the alphabets are considered equal in Serbian, so the
choice is only a matter of preference). So, unless your software
performs romanization (i.e. converts Cyrillic to Latin) or cyrillization
(i.e. vice-versa) on-the-fly, at retrieval time (Wikipedia does it),
many attempts at information extraction will be doomed to fail.

Example: Imagine searching for "parser" when that particular string in
your knowledge store is encoded only in Cyrillic, as "парсер".

You see, this is not that big of a problem when working with local
databases. However, on the Linked Data Web, where, most of the time, you
don't know the alphabet the information you need is stored in, this is a
huge pitfall. 

Example: Imagine querying multiple/multilingual graphs. 

This directly affects common tasks such as keyword search, label-based
SPARQL querying, named entity recognition, etc.

After noticing this problem while working on the Serbian version of
DBpedia, my original idea was to improve the DBpedia Extraction
Framework. Now that I know the actual scale and possible repercussions
of the problem (for example, Hindustani alone, which is also digraphic,
is spoken by some 500 million (!) people worldwide), I see it needs to
be addressed on an entirely different level.

Therefore, I'm wondering what it would take to take this to the right
people and, ultimately, implement a global (i.e. SPARQL) mechanism for
dealing with digraphia. I'm not sure if this is the right place to
start, but if not, then, hopefully, some of you will point me in the
right direction. Thanks in advance! 

Best,
Uroš Milošević
------------------------------
Institute Mihajlo Pupin
15 Volgina, 11060 Belgrade, Serbia
Phone: +381 11 6771398, Fax: +381 11 6776583 

[1] http://lod2.eu [1] 
[2] http://en.wikipedia.org/wiki/Digraphia
[3] http://www.omniglot.com/language/articles/digraphia/ [2] 

Links:
------
[1] http://lod2.eu
[2] http://www.omniglot.com/language/articles/digraphia/

Received on Thursday, 4 September 2014 14:48:13 UTC