Re: Digraphia on the Linked Data Web from John P. McCrae on 2014-09-10 (public-bpmlod@w3.org from September 2014)

From: John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de>
Date: Wed, 10 Sep 2014 15:36:13 +0200
To: uros.milosevic@pupin.rs
Cc: public-bpmlod@w3.org
Message-ID: <CAC5njqpoU_is4vOY5yJ66r4C3XeeyOxgWOkKESBUwxqL5Ospfg@mail.gmail.com>
Hi,

"Redundant" means that there used to be two primary language subtags
"sr-Latn" and "sr-Cyrl" both of which are now replaced with the combination
of the language subtag "sr" and the script subtag "Latn" or "Cyrl" to give
"sr-Latn" and "sr-Cyrl"! (see 2.7 of RFC 4645 for an explanation)

You are right about transliteration not being the same as transcription,
although I will note that this is not really the general case, for example
in Chinese every traditional character (as used in HK and Taiwan) maps to a
single simplified character (as used in PRC), but the reverse is not true.
As such, looking into both true transliteration and transcription may be
interesting, and drawing a boundary between the two may be challenging.

Regards,
John

On Wed, Sep 10, 2014 at 12:53 PM, Uros Milosevic <uros.milosevic@pupin.rs>
wrote:

>  Hi John,
>
> Thanks for looking into the issue, as well as for all your comments and
> suggestions.
>
> Regarding the language tag solution you mentioned - the IANA language
> subtag registry [1] says the two are "redundant" (not sure what this means,
> exactly, to be honest), and I  can imagine that establishing the correct
> use of such tags as a best practice throughout the Linked Data Web would be
> quite a challenge. Moreover, that's also under the assumption that we have
> direct control over all those data sources, which is, unfortunately, not
> the case, and there's already so much information out there that relies on
> the "simple" form of the tag, i.e. "sr" (or similar, for other digraphic
> languages).
>
> Now, the real problem here are not the SPARQL or language experts, who are
> aware of the problem and know exactly what to look for, nor is it the case
> of querying a dataset encoded in a single alphabet/language. The problem is
> best illustrated in the case of "common" (i.e. Internet) users, who are
> presented with nothing but a search box that's supposed to query a large
> multilingual graph, or multiple graphs/datasets. Another example is
> presenting a user with an annotated piece of text (e.g. provided through
> Named Entity Recognition), then matching those words and phrases against a
> *custom* list of knowledge sources (e.g. SPARQL endpoints; a scenario which
> I, actually, encountered just recently).
>
> Finally, I think you're confusing transliteration with transcription.
> Unlike transcription, "transliteration is not concerned with representing
> the phonemics of the original: it only strives to represent the characters
> accurately" [2]. In Serbian, transliteration works in accordance with a
> very clear set of rules, whereas transcription is, well, open to
> interpretation. That's why there would be no real "computational cost", as
> there should be only one possible transliteration output (for Serbian, at
> least). So, transliterating "John" from your example would result in
> "Јохн", but the general idea, i.e. the filter function you proposed is
> exactly what I believe Linked Data needs. Also, although transliteration is
> possible for any language, I think only the languages that are officially
> digraphic are in need of such a mechanism (Serbian is the only European
> language that's officially digraphic, meaning you'll find entire documents,
> books, websites etc. in any of the two scripts).
>
> I'm really willing to look into the issue and, even, provide that first
> implementation of the transliteration mechanism (for Virtuoso, for
> instance), but I wouldn't want to be alone in this. That's why I was hoping
> to spark some community discussion, engagement and support. Thanks, again!
>
> Best,
> Uroš
>
>
>
> [1]
> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>
> [2] http://en.wikipedia.org/wiki/Transliteration
>
>
>
> On 10.9.2014 11:21, John P. McCrae wrote:
>
>      Hi Uros,
>
> Some thoughts on the issue...
>
> Digraphia is an interesting challenge, but the solutions mostly boil down
> to treating the two scripts as two separate language. In this case RDF
> works fine assuming you tag the literals correctly as sr (or sr-Cyrl) for
> Serbian in Cyrillic and sr-Latn for Serbian in Latin, you can and generally
> do omit the script tag for the official script, which in Serbia is
> Cyrillic. In this case, the problem of query is now no harder than
> retrieving a literal by keywords in English and French.
>
> Transliteration, while as you noted deployed in many existing web search
> systems, still has several challenges, namely computational cost*, accuracy
> and availability for all languages. As such it seems unlikely, that it
> would be something that could be built into SPARQL systems in general. It
> is however quite possible that it could be introduced to some systems as an
> extra function, e.g.,
>
> SELECT ?person {
>   ?person foaf:name ?name .
>   FILTER transliteration(?name, "John", "Latn")
> }
>
> Could return:
>
> <> foaf:name "Джон"@sr
>
> It could be an interesting project to implement such a thing as a specific
> system. I suspect such a function could not be standardized by the SPARQL
> WG, but if there were a determined group of people in BPM-LOD willing to
> provide reference implementations, we could in the context of the group
> attempt to provide an advisory on the implementation of such a function.
>
> Regards,
> John McCrae
>
> * By computational cost I refer not so much to the cost of
> transliteration, but the generation of multiple possible transliteration
> candidates that would need to be checked in the database. For example, in
> Greek the letters Eta, Iota and Upsilon are all mapped to the Latin letter
> "i" in modern Greek transliteration.
>
>
>
> On Tue, Sep 2, 2014 at 3:40 PM, Uroš Milošević <uros.milosevic@pupin.rs>
> wrote:
>
>>  Hi all,
>>
>> Perhaps now, with the summer holiday season officially over, my message
>> will get through to the right people. :) I don’t mind trying elsewhere, I
>> just thought that a W3C community group dealing with “Best Practices for
>> Multilingual Linked Open Data” would be a good place to start.
>> Thanks, again.
>>
>> Best,
>> Uroš Milošević
>>
>>
>>
>> *From:* Uroš Milošević [mailto:uros.milosevic@pupin.rs]
>> *Sent:* Tuesday, August 19, 2014 3:50 PM
>> *To:* public-bpmlod@w3.org
>> *Subject:* Digraphia on the Linked Data Web
>>
>>
>>
>> Hi all,
>>
>> Summing up my experiences after three years of work on LOD2 (EU FP7
>> project), and some time spent with the DBpedia extraction framework, I’ve
>> come to some conclusions related to Linked Data and digraphic languages
>> (i.e. those that use multiple writing systems) [1,2] I would like to share
>> with you.
>>
>> As some of you may (or may not) know, Serbian, unlike any other language
>> in Europe, is digraphic in nature, officially supporting both (Serbian)
>> Cyrillic and (Gaj's) Latin alphabet. This is absolutely fine for storing
>> information in any modern knowledge base, but can often be a major obstacle
>> for information retrieval.
>>
>> For instance, most Serbs rely on the Latin alphabet for colloquial
>> communication/interaction on the Web. That means a large portion of the
>> information is (and, often, expected to be) encoded in Latin-2. And, yet,
>> most of the information on the Serbian Wikipedia is encoded in Serbian
>> Cyrillic (the alphabets are considered equal in Serbian, so the choice is
>> only a matter of preference). So, unless your software performs
>> romanization (i.e. converts Cyrillic to Latin) or cyrillization (i.e.
>> vice-versa) on-the-fly, at retrieval time (Wikipedia does it), many
>> attempts at information extraction will be doomed to fail (imagine
>> searching for “parser” when that particular string in your knowledge store
>> is encoded only in Cyrillic, as “парсер”).
>>
>> This is not that big of a problem when working with local databases.
>> However, on the Linked Data Web, where, most of the time, you don’t know
>> the alphabet the information you need is stored in, this is a huge pitfall.
>> This directly affects common tasks such as keyword search, label-based
>> SPARQL querying, named entity recognition, etc.
>>
>> After noticing this problem while working on the Serbian version of
>> DBpedia, my original idea was to improve the DBpedia Extraction Framework.
>> Now that I know the actual scale and possible repercussions of the problem
>> (for example, Hindustani alone, which is also digraphic, is spoken by some
>> 500 million people worldwide), I see it needs to be addressed on an
>> entirely different level.
>>
>> Therefore, I’m wondering what it would take to take this to the right
>> people and, ultimately, implement a global (i.e. SPARQL) mechanism for
>> dealing with digraphia. I’m not sure if this is the right place to start,
>> but if not, then, hopefully, some of you will point me in the right
>> direction. Thanks in advance!
>>
>>
>>
>> Best,
>>
>> Uroš Milošević
>>
>> ------------------------------
>>
>> Institute Mihajlo Pupin
>>
>> 15 Volgina, 11060 Belgrade, Serbia
>>
>> Phone: +381 65 3220223, Fax: +381 11 6776583
>>
>>
>>
>> [1] http://en.wikipedia.org/wiki/Digraphia
>>
>> [2] http://www.omniglot.com/language/articles/digraphia/
>>
>>
>> ------------------------------
>>
>> <http://www.avast.com/>
>>
>> This email is free from viruses and malware because avast! Antivirus
>> <http://www.avast.com/> protection is active.
>>
>>
>>
>>
>> ------------------------------
>>    <http://www.avast.com/>
>>
>> This email is free from viruses and malware because avast! Antivirus
>> <http://www.avast.com/> protection is active.
>>
>
>
Attachments

image/gif attachment: blocked.gif
Received on Wednesday, 10 September 2014 13:36:47 UTC