RE: Multidimensional Quality Metrics and Linguistic Linked Data


I have a look and I come back to you.


From: Dave Lewis []
Sent: 30 July 2014 18:30
Subject: Re: Multidimensional Quality Metrics and Linguistic Linked Data

Hi Thomas,
Thanks for this input, we'll cover this on the LD4LT call tomorrow (31st).

Also, note we have seeded a best practice guidelines document in the
BP-MLOD group on publishing bi-text, see:

this was in response to the need for clearer linked data bitext
guidelines identified in a document being discussed with GALA, TAUS and
others at the ITS Interest Group in relation to open data management for
public MT services. this is intended as an input to the MLi
specification activities in relation to the planned connecting Europe
Facility on Automated Translation.

Would you be interested in contributing your existing work on bi-text
publishing to this activity?


On 21/07/2014 14:28, wrote:
> Dave,
> 1. Manual and programmatic processing
> - Manual: applicable to corpora of a few megabytes.
> - Programmatic: mandatory for large corpora, from hundreds of megabytes to terabyte size.
> The most relevant application today is the data cleaning for translation memories (TM) and statistical machine translation (SMT). Results (annotation or other) must be properly structure.
> 2.  Record and database
> The specification must be valid for one record and for a whole database. For example, an agent (human or machine) could be reading one term in a terminological database or downloading 1TB database. For applications such as training SMT engines the data must be local.
> 3. N-lingual
> The specification must be n-lingual. Programmatic processing of large n-lingual corpus could obtain results that it is not possible with other corpora such as bilingual corpora: one has more data to play with. This field needs a lot of work.
> 4. Terminology
> We should fix straight away the terminology to avoid misunderstanding. Proposals:
> - Corpus: dataset containing linguistic data. Synonym: linguistic corpus. Plural: corpora.
> - Media-types of the linguistic data: a subset of the IANA registered media-types; in particular, text, sound and video. A corpus could contain data with several media-types, though typically all the data should have the same media-type.
> - Corpus types according the number of different languages: monolingual, bilingual and n-lingual. These are the most common cases; n-lingual is the most general and the other can be considered particular cases.
> - Aligned multilingual parallel texts: synchronised linguistic data.
> - Monolingual corpus: corpus with all the linguistic data are in the same language.
> - Bilingual corpus: corpus with aligned linguistic data in two languages.
> - N-lingual corpus: corpus with aligned linguistic data in n-languages.
> More at "Open architecture for multilingual parallel texts"
> Regards
> Tomas
> -----Original Message-----
> From: Dave Lewis []
> Sent: Thursday, July 17, 2014 5:16 PM
> Subject: Re: Multidimensional Quality Metrics and Linguistic Linked Data
> Hi Manuel,
> As we discussed on the LD4LT call, I completely agree that we need to
> support automated annotations as well as manual ones. The meta-share
> ontology has support for this, but there may be requirements for
> capturing more meta-data in relation to automated annotation, such as
> confidence scores and the type, version, instance & provenance of the
> automated agent performing the annotation.
> We touched on some of these issues in the MLW-LT working group in
> developing ITS2.0, but I think we'll hit them again when examining MQM
> in more detail as an annotation mechanims for linked data. This already
> includes both manual and automated quality annotations.
> We also need to consider situation where automated annotation are run
> through a manual sampled QA check process, i.e. where annotation are a
> combination of a automated and manual process.
> So any concrete requirements, use cases or examples you can expose from
> your current automated annotation work would be very welcome contributions.
> Kind Regards,
> Dave
> On 10/07/2014 13:19, wrote:
>> Dave,
>> I am working in preparing large amount of multilingual data programmatically. For smallish amount of data, one can consider manual annotations; for large amount is unrealistic and it has to be done programmatically.
>> The same goes for the quality. First one has to aim for language independent quality techniques; for example,  statistical techniques for measuring corpus dependent length rations among n-languages. It could be later further refined with language dependent techniques.
>> When the programs are more stable, they might be published.
>> Regards
>> Tomas
>> -----Original Message-----
>> From: Dave Lewis []
>> Sent: Wednesday, July 02, 2014 3:46 PM
>> To:
>> Subject: Multidimensional Quality Metrics and Linguistic Linked Data
>> Hi all,
>> One requirement that has come up several times in our discussions about
>> lingusitic linked data is how to manage the quality of such data. One
>> obvious advantage of using linked data for language resources is that it
>> becomes easy for third parties to annotate different parts of that
>> resource with quality assessment which can be used to drive quality
>> management processes.
>> To thins end we initialed some discussion with Arle Lommel from DFKI who
>> has been driving the development of the Multidimensional Quality Metrics
>> (MQM) specification in the QTLauchpad project:
>> This encompasses a wide range of established quality assessment metric
>> for translation, and therefore may provide a concrete model that could
>> be used more widely in annotating the lingusitic quality of multilingual
>> linked data.
>> There has also been a discussion in the ITS IG about the benefits to the
>> MQM spec itself from an RDF mapping:
>> We plan to cover this in tomorrow's LD4LT call, but please send any
>> additional thoughts you may have on this to the list, e.g. on other existing quality
>> annotation vocabularies in use for linked data.
>> cheers,
>> Dave

Received on Thursday, 31 July 2014 06:40:25 UTC