- From: Jorge Gracia <jgracia@fi.upm.es>
- Date: Mon, 8 Sep 2014 13:16:11 +0200
- To: Uroš Milošević <uros.milosevic@pupin.rs>
- Cc: "public-bpmlod@w3.org" <public-bpmlod@w3.org>, public-ld4lt@w3.org
- Message-ID: <CANzuSaO1ERrzSBuLKTA+TF3MvLQeCimxbjYMRO_6eqxvMBYBGg@mail.gmail.com>
Dear Uroš, Thanks for your email, I am forwarding it to the "Linked Data for Language Technologies" community group as well, since they are collecting use cases related to linguistics, LD, and content analytics. Indeed your use case (related to digraphia on the Web of data) seems a very interesting one and touches knowledge recovery and knowledge representation issues. Of course, if from your experience you think that some best practises or recommendations can be identified, they could be treated in the BPMLOD group as well (we could introduce the topic in a future telco if you wish). My first feeling (I have not studied the problem in detail, though) is that richer formats such as lemon [1] for representing lexical information could play an important role for representing/recovering digraphic data. Regards, Jorge [1] http://lemon-model.net/ 2014-09-02 15:40 GMT+02:00 Uroš Milošević <uros.milosevic@pupin.rs>: > Hi all, > > Perhaps now, with the summer holiday season officially over, my message > will get through to the right people. :) I don’t mind trying elsewhere, I > just thought that a W3C community group dealing with “Best Practices for > Multilingual Linked Open Data” would be a good place to start. > Thanks, again. > > Best, > Uroš Milošević > > > > *From:* Uroš Milošević [mailto:uros.milosevic@pupin.rs] > *Sent:* Tuesday, August 19, 2014 3:50 PM > *To:* public-bpmlod@w3.org > *Subject:* Digraphia on the Linked Data Web > > > > Hi all, > > Summing up my experiences after three years of work on LOD2 (EU FP7 > project), and some time spent with the DBpedia extraction framework, I’ve > come to some conclusions related to Linked Data and digraphic languages > (i.e. those that use multiple writing systems) [1,2] I would like to share > with you. > > As some of you may (or may not) know, Serbian, unlike any other language > in Europe, is digraphic in nature, officially supporting both (Serbian) > Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing > information in any modern knowledge base, but can often be a major obstacle > for information retrieval. > > For instance, most Serbs rely on the Latin alphabet for colloquial > communication/interaction on the Web. That means a large portion of the > information is (and, often, expected to be) encoded in Latin-2. And, yet, > most of the information on the Serbian Wikipedia is encoded in Serbian > Cyrillic (the alphabets are considered equal in Serbian, so the choice is > only a matter of preference). So, unless your software performs > romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. > vice-versa) on-the-fly, at retrieval time (Wikipedia does it), many > attempts at information extraction will be doomed to fail (imagine > searching for “parser” when that particular string in your knowledge store > is encoded only in Cyrillic, as “парсер”). > > This is not that big of a problem when working with local databases. > However, on the Linked Data Web, where, most of the time, you don’t know > the alphabet the information you need is stored in, this is a huge pitfall. > This directly affects common tasks such as keyword search, label-based > SPARQL querying, named entity recognition, etc. > > After noticing this problem while working on the Serbian version of > DBpedia, my original idea was to improve the DBpedia Extraction Framework. > Now that I know the actual scale and possible repercussions of the problem > (for example, Hindustani alone, which is also digraphic, is spoken by some > 500 million people worldwide), I see it needs to be addressed on an > entirely different level. > > Therefore, I’m wondering what it would take to take this to the right > people and, ultimately, implement a global (i.e. SPARQL) mechanism for > dealing with digraphia. I’m not sure if this is the right place to start, > but if not, then, hopefully, some of you will point me in the right > direction. Thanks in advance! > > > > Best, > > Uroš Milošević > > ------------------------------ > > Institute Mihajlo Pupin > > 15 Volgina, 11060 Belgrade, Serbia > > Phone: +381 65 3220223, Fax: +381 11 6776583 > > > > [1] http://en.wikipedia.org/wiki/Digraphia > > [2] http://www.omniglot.com/language/articles/digraphia/ > > > ------------------------------ > > <http://www.avast.com/> > > This email is free from viruses and malware because avast! Antivirus > <http://www.avast.com/> protection is active. > > > > > ------------------------------ > <http://www.avast.com/> > > This email is free from viruses and malware because avast! Antivirus > <http://www.avast.com/> protection is active. > > -- Jorge Gracia, PhD Ontology Engineering Group Artificial Intelligence Department Universidad Politécnica de Madrid http://delicias.dia.fi.upm.es/~jgracia/
Received on Monday, 8 September 2014 11:17:03 UTC