Re: Digraphia on the Linked Data Web

 

Hi John,

Thanks for looking into the issue, as well as for all your comments and
suggestions.

Regarding the language tag solution you mentioned - the IANA language
subtag registry [1] says the two are "redundant" (not sure what this
means, exactly, to be honest), and I can imagine that establishing the
correct use of such tags as a best practice throughout the Linked Data
Web would be quite a challenge. Moreover, that's also under the
assumption that we have direct control over all those data sources,
which is, unfortunately, not the case, and there's already so much
information out there that relies on the "simple" form of the tag, i.e.
"sr" (or similar, for other digraphic languages). 

Now, the real problem here are not the SPARQL or language experts, who
are aware of the problem and know exactly what to look for, nor is it
the case of querying a dataset encoded in a single alphabet/language.
The problem is best illustrated in the case of "common" (i.e. Internet)
users, who are presented with nothing but a search box that's supposed
to query a large multilingual graph, or multiple graphs/datasets.
Another example is presenting a user with an annotated piece of text
(e.g. provided through Named Entity Recognition), then matching those
words and phrases against a *custom* list of knowledge sources (e.g.
SPARQL endpoints; a scenario which I, actually, encountered just
recently).

Finally, I think you're confusing transliteration with transcription.
Unlike transcription, "transliteration is not concerned with
representing the phonemics of the original: it only strives to represent
the characters accurately" [2]. In Serbian, transliteration works in
accordance with a very clear set of rules, whereas transcription is,
well, open to interpretation. That's why there would be no real
"computational cost", as there should be only one possible
transliteration output (for Serbian, at least). So, transliterating
"John" from your example would result in "Јохн", but the general idea,
i.e. the filter function you proposed is exactly what I believe Linked
Data needs. Also, although transliteration is possible for any language,
I think only the languages that are officially digraphic are in need of
such a mechanism (Serbian is the only European language that's
officially digraphic, meaning you'll find entire documents, books,
websites etc. in any of the two scripts). 

I'm really willing to look into the issue and, even, provide that first
implementation of the transliteration mechanism (for Virtuoso, for
instance), but I wouldn't want to be alone in this. That's why I was
hoping to spark some community discussion, engagement and support.
Thanks, again! 

Best,
Uroš 

[1]
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry


[2] http://en.wikipedia.org/wiki/Transliteration 

On 10.9.2014 11:21, John P. McCrae wrote: 

> Hi Uros,
> 
> Some thoughts on the issue... 
> Digraphia is an interesting challenge, but the solutions mostly boil down to treating the two scripts as two separate language. In this case RDF works fine assuming you tag the literals correctly as sr (or sr-Cyrl) for Serbian in Cyrillic and sr-Latn for Serbian in Latin, you can and generally do omit the script tag for the official script, which in Serbia is Cyrillic. In this case, the problem of query is now no harder than retrieving a literal by keywords in English and French.
> 
> Transliteration, while as you noted deployed in many existing web search systems, still has several challenges, namely computational cost*, accuracy and availability for all languages. As such it seems unlikely, that it would be something that could be built into SPARQL systems in general. It is however quite possible that it could be introduced to some systems as an extra function, e.g.,
> 
> SELECT ?person { ?person foaf:name ?name . FILTER transliteration(?name, "John", "Latn")
> }
> 
> Could return: 
> <> foaf:name "Джон"@sr
> 
> It could be an interesting project to implement such a thing as a specific system. I suspect such a function could not be standardized by the SPARQL WG, but if there were a determined group of people in BPM-LOD willing to provide reference implementations, we could in the context of the group attempt to provide an advisory on the implementation of such a function.
> 
> Regards,
> John McCrae
> 
> * By computational cost I refer not so much to the cost of transliteration, but the generation of multiple possible transliteration candidates that would need to be checked in the database. For example, in Greek the letters Eta, Iota and Upsilon are all mapped to the Latin letter "i" in modern Greek transliteration.
> 
> On Tue, Sep 2, 2014 at 3:40 PM, Uroš Milošević <uros.milosevic@pupin.rs> wrote:
> 
>> Hi all,
>> 
>> Perhaps now, with the summer holiday season officially over, my message will get through to the right people. :) I don't mind trying elsewhere, I just thought that a W3C community group dealing with "Best Practices for Multilingual Linked Open Data" would be a good place to start.
>> Thanks, again.
>> 
>> Best,
>> Uroš Milošević 
>> 
>> FROM: Uroš Milošević [mailto:uros.milosevic@pupin.rs] 
>> SENT: Tuesday, August 19, 2014 3:50 PM
>> TO: public-bpmlod@w3.org
>> SUBJECT: Digraphia on the Linked Data Web 
>> 
>> Hi all,
>> 
>> Summing up my experiences after three years of work on LOD2 (EU FP7 project), and some time spent with the DBpedia extraction framework, I've come to some conclusions related to Linked Data and digraphic languages (i.e. those that use multiple writing systems) [1,2] I would like to share with you.
>> 
>> As some of you may (or may not) know, Serbian, unlike any other language in Europe, is digraphic in nature, officially supporting both (Serbian) Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing information in any modern knowledge base, but can often be a major obstacle for information retrieval.
>> 
>> For instance, most Serbs rely on the Latin alphabet for colloquial communication/interaction on the Web. That means a large portion of the information is (and, often, expected to be) encoded in Latin-2. And, yet, most of the information on the Serbian Wikipedia is encoded in Serbian Cyrillic (the alphabets are considered equal in Serbian, so the choice is only a matter of preference). So, unless your software performs romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e. vice-versa) on-the-fly, at retrieval time (Wikipedia does it), many attempts at information extraction will be doomed to fail (imagine searching for "parser" when that particular string in your knowledge store is encoded only in Cyrillic, as "парсер").
>> 
>> This is not that big of a problem when working with local databases. However, on the Linked Data Web, where, most of the time, you don't know the alphabet the information you need is stored in, this is a huge pitfall. This directly affects common tasks such as keyword search, label-based SPARQL querying, named entity recognition, etc.
>> 
>> After noticing this problem while working on the Serbian version of DBpedia, my original idea was to improve the DBpedia Extraction Framework. Now that I know the actual scale and possible repercussions of the problem (for example, Hindustani alone, which is also digraphic, is spoken by some 500 million people worldwide), I see it needs to be addressed on an entirely different level.
>> 
>> Therefore, I'm wondering what it would take to take this to the right people and, ultimately, implement a global (i.e. SPARQL) mechanism for dealing with digraphia. I'm not sure if this is the right place to start, but if not, then, hopefully, some of you will point me in the right direction. Thanks in advance! 
>> 
>> Best, 
>> 
>> Uroš Milošević 
>> 
>> ------------------------------ 
>> 
>> Institute Mihajlo Pupin 
>> 
>> 15 Volgina, 11060 Belgrade, Serbia 
>> 
>> Phone: +381 65 3220223 [1], Fax: +381 11 6776583 [2] 
>> 
>> [1] http://en.wikipedia.org/wiki/Digraphia [3] 
>> 
>> [2] http://www.omniglot.com/language/articles/digraphia/ [4] 
>> 
>> -------------------------
>> 
>> [5] 
>> 
>> This email is free from viruses and malware because avast! Antivirus [5] protection is active. 
>> 
>> -------------------------
>> 
>> [5]
>> 
>> This email is free from viruses and malware because avast! Antivirus [5] protection is active.

 

Links:
------
[1] tel:%2B381%2065%203220223
[2] tel:%2B381%2011%206776583
[3] http://en.wikipedia.org/wiki/Digraphia
[4] http://www.omniglot.com/language/articles/digraphia/
[5] http://www.avast.com/

Received on Wednesday, 10 September 2014 10:56:44 UTC