Re: Digraphia on the Linked Data Web from John P. McCrae on 2014-09-10 (public-bpmlod@w3.org from September 2014)

From: John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de>
Date: Wed, 10 Sep 2014 11:21:42 +0200
To: Uroš Milošević <uros.milosevic@pupin.rs>
Cc: public-bpmlod@w3.org
Message-ID: <CAC5njqqNGH_UAJ8AeheMeKYYC-XoV9x26NGpTvympjCikMbAkg@mail.gmail.com>
Hi Uros,

Some thoughts on the issue...

Digraphia is an interesting challenge, but the solutions mostly boil down
to treating the two scripts as two separate language. In this case RDF
works fine assuming you tag the literals correctly as sr (or sr-Cyrl) for
Serbian in Cyrillic and sr-Latn for Serbian in Latin, you can and generally
do omit the script tag for the official script, which in Serbia is
Cyrillic. In this case, the problem of query is now no harder than
retrieving a literal by keywords in English and French.

Transliteration, while as you noted deployed in many existing web search
systems, still has several challenges, namely computational cost*, accuracy
and availability for all languages. As such it seems unlikely, that it
would be something that could be built into SPARQL systems in general. It
is however quite possible that it could be introduced to some systems as an
extra function, e.g.,

SELECT ?person {
  ?person foaf:name ?name .
  FILTER transliteration(?name, "John", "Latn")
}

Could return:

<> foaf:name "Джон"@sr

It could be an interesting project to implement such a thing as a specific
system. I suspect such a function could not be standardized by the SPARQL
WG, but if there were a determined group of people in BPM-LOD willing to
provide reference implementations, we could in the context of the group
attempt to provide an advisory on the implementation of such a function.

Regards,
John McCrae

* By computational cost I refer not so much to the cost of transliteration,
but the generation of multiple possible transliteration candidates that
would need to be checked in the database. For example, in Greek the letters
Eta, Iota and Upsilon are all mapped to the Latin letter "i" in modern
Greek transliteration.



On Tue, Sep 2, 2014 at 3:40 PM, Uroš Milošević <uros.milosevic@pupin.rs>
wrote:

> Hi all,
>
> Perhaps now, with the summer holiday season officially over, my message
> will get through to the right people. :) I don’t mind trying elsewhere, I
> just thought that a W3C community group dealing with “Best Practices for
> Multilingual Linked Open Data” would be a good place to start.
> Thanks, again.
>
> Best,
> Uroš Milošević
>
>
>
> *From:* Uroš Milošević [mailto:uros.milosevic@pupin.rs]
> *Sent:* Tuesday, August 19, 2014 3:50 PM
> *To:* public-bpmlod@w3.org
> *Subject:* Digraphia on the Linked Data Web
>
>
>
> Hi all,
>
> Summing up my experiences after three years of work on LOD2 (EU FP7
> project), and some time spent with the DBpedia extraction framework, I’ve
> come to some conclusions related to Linked Data and digraphic languages
> (i.e. those that use multiple writing systems) [1,2] I would like to share
> with you.
>
> As some of you may (or may not) know, Serbian, unlike any other language
> in Europe, is digraphic in nature, officially supporting both (Serbian)
> Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing
> information in any modern knowledge base, but can often be a major obstacle
> for information retrieval.
>
> For instance, most Serbs rely on the Latin alphabet for colloquial
> communication/interaction on the Web. That means a large portion of the
> information is (and, often, expected to be) encoded in Latin-2. And, yet,
> most of the information on the Serbian Wikipedia is encoded in Serbian
> Cyrillic (the alphabets are considered equal in Serbian, so the choice is
> only a matter of preference). So, unless your software performs
> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e.
> vice-versa) on-the-fly, at retrieval time (Wikipedia does it), many
> attempts at information extraction will be doomed to fail (imagine
> searching for “parser” when that particular string in your knowledge store
> is encoded only in Cyrillic, as “парсер”).
>
> This is not that big of a problem when working with local databases.
> However, on the Linked Data Web, where, most of the time, you don’t know
> the alphabet the information you need is stored in, this is a huge pitfall.
> This directly affects common tasks such as keyword search, label-based
> SPARQL querying, named entity recognition, etc.
>
> After noticing this problem while working on the Serbian version of
> DBpedia, my original idea was to improve the DBpedia Extraction Framework.
> Now that I know the actual scale and possible repercussions of the problem
> (for example, Hindustani alone, which is also digraphic, is spoken by some
> 500 million people worldwide), I see it needs to be addressed on an
> entirely different level.
>
> Therefore, I’m wondering what it would take to take this to the right
> people and, ultimately, implement a global (i.e. SPARQL) mechanism for
> dealing with digraphia. I’m not sure if this is the right place to start,
> but if not, then, hopefully, some of you will point me in the right
> direction. Thanks in advance!
>
>
>
> Best,
>
> Uroš Milošević
>
> ------------------------------
>
> Institute Mihajlo Pupin
>
> 15 Volgina, 11060 Belgrade, Serbia
>
> Phone: +381 65 3220223, Fax: +381 11 6776583
>
>
>
> [1] http://en.wikipedia.org/wiki/Digraphia
>
> [2] http://www.omniglot.com/language/articles/digraphia/
>
>
> ------------------------------
>
> <http://www.avast.com/>
>
> This email is free from viruses and malware because avast! Antivirus
> <http://www.avast.com/> protection is active.
>
>
>
>
> ------------------------------
>    <http://www.avast.com/>
>
> This email is free from viruses and malware because avast! Antivirus
> <http://www.avast.com/> protection is active.
>
>
Received on Wednesday, 10 September 2014 09:22:15 UTC