Re: [bpmlod] Guidelines for converting BabelNet as Linguistic Linked Data from dave.lewis@cs.tcd.ie on 2014-05-30 (public-ontolex@w3.org from May 2014)

From: <dave.lewis@cs.tcd.ie>
Date: Fri, 30 May 2014 15:37:56 +0100
To: "Philipp Cimiano" <cimiano@cit-ec.uni-bielefeld.de>
Cc: public-ontolex@w3.org
Message-ID: <304b7df2660847aa46d656a2d2956429.squirrel@webmail.scss.tcd.ie>
Hi Philipp,

I talked  bit more to Jorge about this after his talk at LREC, and its
clearer to me now that there are two distinct use cases here:

Use Case 1) where two terms already exists in the system and you are
drawing the 'cross-lingual variation' you mention. Here a confidence score
of the reified annotation indicated the confidence that you have made that
annotation correctly. Incidently, this term is much better in the case
than 'translation' or 'translation variation' which will clash with its
understanding in the translation industry and definitions in ISO DCR.

Use Case 2) where the target is produced as a process of translating the
source. Here the confidence score is some measure that the translation is
correct, typically in context and therefore sensitive to the process that
generated it. This would be most commonly recognised as a property of the
target language text, in part because the formats used in the industry
don't consistently reify the translation relationship. Certainly in ITS,
the MTconfidence score is defined as a property of the target string (I
did actually raised the annotation of the association myself, but the
tranlation tool experts there convinced me otherwise).

Use case (1) is ameanable to a reified annotation approach as suggested by
Jorge, though perhaps this should be aligned the model of the open
annotation group, rather than defining something completely new:
http://www.w3.org/community/openannotation/

Use case (2) is as you point out more process oriented, and we've found
the provenance-base pattern makes a lot of sense here as I mentioned
earlier because of the sentivities usage of the translation has to the
parameters of the process.

While use case (1) seems quite specific to lexicons (at me at least, but
are there other use cases?), use case (2) is more general and could be
applied other types of LR, such as bi text or annotations that may have
been translated, or indeed the meta-share modelling which sparked this
thread.

We can see already from examples like Babelnet that we would need to
support use cases potentially in the same corpora. It would be interesting
to know how many such relationhip are linked between existing tersm or
fresh translations within babelnet?

Given the above, perhaps the questions we need to ask are:
A) should we include the jorge's reified annotation solution for use case
(1) in ontolex, ideally not naming it 'translation'?

B) should we include the provenance-based approach for the 'translation'
process use case (2) in ontolex, or would that be better captured
elsewhere?This is in scope of the modelling underway in falcon, so we
could host it there with some input from ontolex and ld4lt?

cheers,
Dave


> Hi Dave,
>
> thanks a lot for your input. Most of your comments concerns Translation
> as viewed from the perspective of a process.
>
> So far, in the ontolex group we have regarded "translation" as a special
> case of "cross-lingual variation", abstracting from the process by which
> the actual translation was produced.
>
> So the reified relation "Translation" means rather that two Lexical
> Senses stand to reach other in a relation of translation, independently
> of how this translation was obtained.
>
> We might rename "Translation" as "TranslationVariant" to make this
> clearer.
>
> On your example:
>
> ex:34678es a lemon:LexicalEntry;
>   a prov:Entity;
>   lemon:form [ lemon:writtenRep "casa"@es ];
>   ex:34678es ontolexTrans:wasTranslatedFrom ex:34678en;
>   its:mtConfidence "0.5";
>   ontolexTrans:qualifiedTranslation [
>      a ontolex:Translation;
>      prov:hadActivity ex:ExMachineTranslation;
>   ].
>
> I am not fully convinced here as this example attaches the confidence
> and other properties to the lexical entry. The confidence however should
> be attached to the relation of being a translation of each other IMHO
> rather than to the lexical entries / lexical senses.
>
> So we could certainly attach provenance information to the
> "TranslationVariant" object, but I would not add the prov. information
> to the lexical entries standing in the relation of being a translation
> of each other.
>
> In fact, the confidence is not a property of any lexical entry, it is
> the confidence in the fact that X is the (correct) translation of Y, so
> it should be attached to an object reifying this relation rather than to
> one of the lexical entries or lexical senses involved.
>
> So yes, we could recommend using the Prov-O vocabulary to make the
> provenance information of a "TranslationVariant" explicit.
>
> Does that make sense?
>
> Regards,
>
> Philipp.
>
> Am 27.05.14 03:23, schrieb Dave Lewis:
>> Hi Jorge, guys,
>> Thanks for these pointers, I had not been following this as closely as
>> I should, so I have some comment below that are relevant to both the
>> meta-share RDF model and your translation model in ontolex, so I've
>> copied them also.
>>
>> You are quite correct to reify the translation relationship. Deriving
>> an authoritative translation is rarely straighforward and may involve
>> different inputs at different times from different sources, e.g.
>> babelnet has professionally curated translation, translations from
>> wikipedia and MT oututs.
>>
>> So in many cases you are dealing with the current status of a
>> provisional translations rather than 'final' authoritative.
>>
>> Also, there is some potential confusion in naming the reifying class
>> 'Translation' since in many situations this refers to the string in
>> the targt language rather than the entity linking a target language
>> string to a source language string.
>>
>> In [1] we proposed an approach to handle this by specilising from the
>> W3C Provenance vocubulary [2].
>>
>> This means treating the source and targets of translation
>> (LexicalEntry, LexicalSense) as prov:Entity classes so that their
>> provenance can be tracked using other classes and proerties from that
>> model.
>>
>> Specifically we propose specialising the provenance property:
>> http://www.w3.org/TR/prov-o/#wasDerivedFrom
>>
>> i.e.
>> ontolexTrans:wasTranslatedFrom  rdfs:subPropertyOf
>>        prov:wasDerivedFrom.
>>
>> PROV-O also enables reification by defining a class:
>> http://www.w3.org/TR/prov-o/#Derivation
>>
>> which is in the range of:
>> http://www.w3.org/TR/prov-o/#qualifiedDerivation
>>
>> So similarly we can define
>> ontolexTrans:Translation rdfs:subClassOf prov:Derivation.
>>
>> and
>>
>> ontolexTrans:qualifiedTranslation rdfs:subPropertyOf
>>       prov:qualifiedDerivation.
>>
>> To flesh this out with an example:
>>
>> ex:34678en a lemon:LexicalEntry;
>>  a prov:Entity;
>>  lemon:form [ lemon:writtenRep "house"@en ] .
>>
>> ex:34678es a lemon:LexicalEntry;
>>  a prov:Entity;
>>  lemon:form [ lemon:writtenRep "casa"@es ];
>>  ex:34678es ontolexTrans:wasTranslatedFrom ex:34678en;
>>  its:mtConfidence "0.5";
>>  ontolexTrans:qualifiedTranslation [
>>     a ontolex:Translation;
>>     prov:hadActivity ex:ExMachineTranslation;
>>  ].
>>
>> Note in the above the its:mtConfidence is more accurately used to
>> annotate the LexicalEntry rather than the Translation, as it is a
>> property of the text resulting from the translation, rather than a
>> reification of the translation.
>>
>> Thoughts welcome.
>>
>> cheers,
>> Dave
>>
>>
>>
>>
>>
>>
>>
>> [1] http://www.lrec-conf.org/proceedings/lrec2012/pdf/636_Paper.pdf
>> [2] http://www.w3.org/TR/prov-o/
>> On 23/05/2014 14:48, Jorge Gracia wrote:
>>> Dear Tiziano, Roberto
>>>
>>> You could also consider using the lemon translation module to
>>> represent explicit translations as linked data. This is currently
>>> under development in the ONTOLEX group but there is a lemon-based
>>> version already available, that I will present at LREC next week [1].
>>> The idea is reifying the translation relation so you can attach
>>> additional information to it (source, target, confidence, provenance,
>>> etc.) [2]
>>>
>>> Regards,
>>>
>>> Jorge
>>>
>>> [1]
>>> http://ra.cps.unizar.es:8080/PUBLICATIONS/attachedFiles/document/LREC2014_translations_V11.pdf
>>> [2] http://purl.org/net/translation#
>>>
>>>
>>>
>>>
>>> 2014-05-23 11:58 GMT+02:00 Dave Lewis <dave.lewis@cs.tcd.ie
>>> <mailto:dave.lewis@cs.tcd.ie>>:
>>>
>>>     Roberto, Tiziano,
>>>     Thanks for that.
>>>
>>>     Have you considered already how you might allow third parties to
>>>     QA and perhaps correct those translations? That is, some sort of
>>>     process by which proposed MT translations between senses can be
>>>     promoted to more authoritative, human checked translations, and
>>>     marked as such?
>>>
>>>     The ITS text analytics and/or terminology data categories, which
>>>     also have confidence scores could be useful for annotating such a
>>>     process:
>>>     http://www.w3.org/TR/its20/#textanalysis
>>>     http://www.w3.org/TR/its20/#terminology
>>>
>>>     To enable such checking and progression in the authoritativeness
>>>     of senses in different languages, it is important that you record
>>>     what senses are a translation of what other senses.
>>>
>>>     In relation to the senses that are extracted from Wikipedia
>>>     interlanguage links. Do you consider those 'translations', and in
>>>     particular can you tell from those which is the source and which
>>>     is the target?
>>>
>>>     Interested to hear what you think.
>>>
>>>     cheers,
>>>     Dave
>>>
>>>
>>>
>>>     On 22/05/2014 17:41, Roberto Navigli wrote:
>>>>     Thanks Felix! To answer Dave's comment: translations come from
>>>>     the automatic translations of semantically annotated corpora, as
>>>>     Tiziano said, and we have a confidence for each of these
>>>>     translations together with the source of the original text.
>>>>
>>>>     Best,
>>>>     Roberto
>>>>
>>>>
>>>>     2014-05-22 18:35 GMT+02:00 Tiziano Flati
>>>>     <tiziano.flati@gmail.com <mailto:tiziano.flati@gmail.com>>:
>>>>
>>>>         @Felix:
>>>>
>>>>             I am wondering if ITS 2.0 properties could help here, see
>>>>             https://www.w3.org/International/its/wiki/ITS-RDF_mapping
>>>>             There is mtConfidence which provides the confidence
>>>>             value for machine translation and
>>>>             mtConfidenceAnnotatorsRef  to identify the tool used.
>>>>             Also, there is provenance related properties, starting
>>>>             at  :org, until :revToolRef, that could identify the
>>>>             provenance information you need. The underlying
>>>>             definitions for the two ITS data categories are at
>>>>             http://www.w3.org/TR/its20/#provenance
>>>>             http://www.w3.org/TR/its20/#mtconfidence
>>>>
>>>>         Yes, I think that the ITS 2.0 can definitely be a very good
>>>>         point to explore. At the moment I don't think we need
>>>>         modelling properties more complex than those ones (such as
>>>>         mtConfidenceRule, etc.), so I think this fits well our needs.
>>>>
>>>>         @Lewis:
>>>>
>>>>             Do you know currently the provenance of the translation
>>>>             between senses in babelNet. Have you produced any of the
>>>>             translations yourself, or to you just take the links
>>>>             where they are present in the source resources, e.g.
>>>>             DBpedia.
>>>>             What is the policy in Babelnet, is some translation
>>>>             better than none, or is there a translation confidence
>>>>             threshold, e.g. based on human checking, Mt confidence
>>>>             or logical inference etc that you employ?
>>>>
>>>>         BabelNet translations can come from explicit resource
>>>>         information (e.g., Wikipedia interlanguage links) or as
>>>>         automatic translations supported by millions of sense-tagged
>>>>         sentences coming from Wikipedia and Semcor.
>>>>         In conclusion, AFAIK, BabelNet *does have* translation
>>>>         quality estimation, so I think that indication about
>>>>         confidence could be also provided. (Roberto, correct me if I
>>>>         am wrong)
>>>>
>>>>         Thank you all for your comments and suggestions :)
>>>>         Tiziano
>>>>
>>>>         2014-05-22 16:07 GMT+02:00 Dave Lewis <dave.lewis@cs.tcd.ie
>>>>         <mailto:dave.lewis@cs.tcd.ie>>:
>>>>
>>>>             Hi Tiziano, Roberto,
>>>>             Do you know currently the provenance of the translation
>>>>             between senses in babelNet. Have you produced any of the
>>>>             translations yourself, or to you just take the links
>>>>             where they are present in the source resources, e.g.
>>>>             DBpedia.
>>>>
>>>>             In a localization or MT application we look at in CNGL
>>>>             and FALCON, where we may use translation to guide
>>>>             translators or help train MT engines, the provenance is
>>>>             important so some policies can be applied to reduce the
>>>>             propagation of inaccurate translation, or translation
>>>>             that are not appropriate to the context at hand - so
>>>>             those ITS attributes are really important there. To
>>>>             thins extend, when representing this as linked data, we
>>>>             define 'wasTranslatedFrom' as a property of
>>>>             'prov:wasDerivedFrom' to reify other provenance
>>>>             meta-data -  agents, tools, context etc.
>>>>
>>>>             What is the policy in Babelnet, is some translation
>>>>             better than none, or is there a translation confidence
>>>>             threshold, e.g. based on human checking, Mt confidence
>>>>             or logical inference etc that you employ?
>>>>
>>>>             many thanks,
>>>>             Dave
>>>>
>>>>
>>>>             On 22/05/2014 10:42, Felix Sasaki wrote:
>>>>>             Hi Titziano,
>>>>>
>>>>>             sorry that I could not make the call due to personal
>>>>>             reasons.
>>>>>
>>>>>             In the draft I saw under â€žtranslationâ€œ this issue:
>>>>>
>>>>>             â€žIssues: Information about translation confidence (was
>>>>>             it humanly or automatically produced? if automatic,
>>>>>             with what confidence score?) and translation provenance
>>>>>             (what text(s) does the translation come from? who
>>>>>             translated and with what tool?).
>>>>>             Another issue concerns whether the
>>>>>             relation lexinfo:translation is essential or not: every
>>>>>             sense in a language within a BabelSynset is, in fact, a
>>>>>             translation of any other sense in another language, so
>>>>>             that this information could actually be derived
>>>>>             (problem of redundancy). However, having data linked
>>>>>             one to each other could also be a benefit, since
>>>>>             the information is explicit in the resource.â€œ
>>>>>
>>>>>             I am wondering if ITS 2.0 properties could help here, see
>>>>>
>>>>>             https://www.w3.org/International/its/wiki/ITS-RDF_mapping
>>>>>
>>>>>             There is mtConfidence which provides the confidence
>>>>>             value for machine translation and
>>>>>             mtConfidenceAnnotatorsRef  to identify the tool used.
>>>>>
>>>>>             Also, there is provenance related properties, starting
>>>>>             at  :org, until :revToolRef, that could identify the
>>>>>             provenance information you need. The underlying
>>>>>             definitions for the two ITS data categories are at
>>>>>             http://www.w3.org/TR/its20/#provenance
>>>>>             http://www.w3.org/TR/its20/#mtconfidence
>>>>>
>>>>>             Best,
>>>>>
>>>>>             Felix
>>>>>
>>>>>             Am 22.05.2014 um 10:12 schrieb Tiziano Flati
>>>>>             <tiziano.flati@gmail.com
>>>>> <mailto:tiziano.flati@gmail.com>>:
>>>>>
>>>>>>             Dear all,
>>>>>>
>>>>>>             we have compiled a first draft of guidelines for the
>>>>>>             conversion of BabelNet as Linguistic Linked Data. The
>>>>>>             initial draft is here
>>>>>>             <https://docs.google.com/document/d/184C_AjY7_PYBSc8SnAFghGLyTo1v312N34dsP9QZokI/edit#>.
>>>>>>
>>>>>>             We can probably integrate this into the BPMLOD
>>>>>>             community report both as a separate document and in
>>>>>>             the form of all our resource-dependent and independent
>>>>>>             details/comments.
>>>>>>             Any feedback and comment is also very appreciated and
>>>>>>             will help us improving the draft.
>>>>>>
>>>>>>             Best regards,
>>>>>>             Tiziano Flati and Roberto Navigli
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>     --
>>>>     =====================================
>>>>     Roberto Navigli
>>>>     Dipartimento di Informatica
>>>>     Sapienza University of Rome
>>>>     Viale Regina Elena 295 (second floor)
>>>>     00161 Roma Italy
>>>>     Phone: +39 0649255161 <tel:%2B39%200649255161> - Fax: +39 06
>>>>     8541842 <tel:%2B39%2006%208541842>
>>>>     Home Page: http://wwwusers.di.uniroma1.it/~navigli
>>>>     <http://wwwusers.di.uniroma1.it/%7Enavigli>
>>>>     =====================================
>>>
>>>
>>>
>>>
>>> --
>>> Jorge Gracia, PhD
>>> Ontology Engineering Group
>>> Artificial Intelligence Department
>>> Universidad PolitÃ©cnica de Madrid
>>> http://delicias.dia.fi.upm.es/~jgracia/
>>> <http://delicias.dia.fi.upm.es/%7Ejgracia/>
>>
>
>
> --
>
> Prof. Dr. Philipp Cimiano
>
> Phone: +49 521 106 12249
> Fax: +49 521 106 12412
> Mail: cimiano@cit-ec.uni-bielefeld.de
>
> Forschungsbau Intelligente Systeme (FBIIS)
> Raum 2.307
> UniversitÃ¤t Bielefeld
> Inspiration 1
> 33619 Bielefeld
>
>
Received on Friday, 30 May 2014 14:38:23 UTC