RE: Multidimensional Quality Metrics and Linguistic Linked Data from Manuel.CARRASCO-BENITEZ@ec.europa.eu on 2014-07-21 (public-ld4lt@w3.org from July 2014)

From: <Manuel.CARRASCO-BENITEZ@ec.europa.eu>
Date: Mon, 21 Jul 2014 13:28:28 +0000
To: <dave.lewis@cs.tcd.ie>, <public-ld4lt@w3.org>
Message-ID: <39DB516E46C0E842A2CFFF1BBB7412F15F7E6294@S-DC-ESTF03-B.net1.cec.eu.int>
Dave,

1. Manual and programmatic processing
- Manual: applicable to corpora of a few megabytes.
- Programmatic: mandatory for large corpora, from hundreds of megabytes to terabyte size.

The most relevant application today is the data cleaning for translation memories (TM) and statistical machine translation (SMT). Results (annotation or other) must be properly structure. 

2.  Record and database
The specification must be valid for one record and for a whole database. For example, an agent (human or machine) could be reading one term in a terminological database or downloading 1TB database. For applications such as training SMT engines the data must be local.

3. N-lingual
The specification must be n-lingual. Programmatic processing of large n-lingual corpus could obtain results that it is not possible with other corpora such as bilingual corpora: one has more data to play with. This field needs a lot of work. 

4. Terminology
We should fix straight away the terminology to avoid misunderstanding. Proposals:

- Corpus: dataset containing linguistic data. Synonym: linguistic corpus. Plural: corpora.

- Media-types of the linguistic data: a subset of the IANA registered media-types; in particular, text, sound and video. A corpus could contain data with several media-types, though typically all the data should have the same media-type.

- Corpus types according the number of different languages: monolingual, bilingual and n-lingual. These are the most common cases; n-lingual is the most general and the other can be considered particular cases.

- Aligned multilingual parallel texts: synchronised linguistic data.

- Monolingual corpus: corpus with all the linguistic data are in the same language.

- Bilingual corpus: corpus with aligned linguistic data in two languages.

- N-lingual corpus: corpus with aligned linguistic data in n-languages.

More at "Open architecture for multilingual parallel texts"
  http://arxiv.org/pdf/0808.3889


Regards
Tomas

-----Original Message-----
From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie] 
Sent: Thursday, July 17, 2014 5:16 PM
To: CARRASCO BENITEZ Manuel (DGT); public-ld4lt@w3.org
Subject: Re: Multidimensional Quality Metrics and Linguistic Linked Data

Hi Manuel,
As we discussed on the LD4LT call, I completely agree that we need to 
support automated annotations as well as manual ones. The meta-share 
ontology has support for this, but there may be requirements for 
capturing more meta-data in relation to automated annotation, such as 
confidence scores and the type, version, instance & provenance of the 
automated agent performing the annotation.

We touched on some of these issues in the MLW-LT working group in 
developing ITS2.0, but I think we'll hit them again when examining MQM 
in more detail as an annotation mechanims for linked data. This already 
includes both manual and automated quality annotations.

We also need to consider situation where automated annotation are run 
through a manual sampled QA check process, i.e. where annotation are a 
combination of a automated and manual process.

So any concrete requirements, use cases or examples you can expose from 
your current automated annotation work would be very welcome contributions.

Kind Regards,
Dave

On 10/07/2014 13:19, Manuel.CARRASCO-BENITEZ@ec.europa.eu wrote:
> Dave,
>
> I am working in preparing large amount of multilingual data programmatically. For smallish amount of data, one can consider manual annotations; for large amount is unrealistic and it has to be done programmatically.
>
> The same goes for the quality. First one has to aim for language independent quality techniques; for example,  statistical techniques for measuring corpus dependent length rations among n-languages. It could be later further refined with language dependent techniques.
>
> When the programs are more stable, they might be published.
>
> Regards
> Tomas
>
> -----Original Message-----
> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> Sent: Wednesday, July 02, 2014 3:46 PM
> To: public-ld4lt@w3.org
> Subject: Multidimensional Quality Metrics and Linguistic Linked Data
>
> Hi all,
> One requirement that has come up several times in our discussions about
> lingusitic linked data is how to manage the quality of such data. One
> obvious advantage of using linked data for language resources is that it
> becomes easy for third parties to annotate different parts of that
> resource with quality assessment which can be used to drive quality
> management processes.
>
> To thins end we initialed some discussion with Arle Lommel from DFKI who
> has been driving the development of the Multidimensional Quality Metrics
> (MQM) specification in the QTLauchpad project:
> http://www.qt21.eu/mqm-definition/definition-2014-06-04.html

>
> This encompasses a wide range of established quality assessment metric
> for translation, and therefore may provide a concrete model that could
> be used more widely in annotating the lingusitic quality of multilingual
> linked data.
>
> There has also been a discussion in the ITS IG about the benefits to the
> MQM spec itself from an RDF mapping:
>
> http://lists.w3.org/Archives/Public/public-i18n-its-ig/2014Jun/0009.html

>
> We plan to cover this in tomorrow's LD4LT call, but please send any
> additional thoughts you may have on this to the list, e.g. on other existing quality
> annotation vocabularies in use for linked data.
>
> cheers,
> Dave
>
>
Received on Monday, 21 July 2014 13:28:59 UTC