RE: [ISSUE-42] Wording for the tool information markup

Hi Dave,

With the third option I mean the situation when you have, for instance, embedded in the data (what format or what tags, does not actually matter) some information (let’s say 5MB of encoded data), which  should never be processed with a translation engine as that would be useless waste of computational resources (with large amounts of such information also sometimes raise stability issues... and require much more intensive development efforts to make systems stable enough). If you do process it and say that it is useful context, but keep the translation as is, you actually ask the MT engine to deal with such maybe vast amounts of data and use it for contextual information. But ... it may even not contain any useful contextual information.

In my opinion, when building a Web access MT system, I personally would divide all data in three groups: 1) translatable, 2) non-translatable with useful contextual information, 3) non-translatable with no useful contextual information (ignorable).

The question is, whether you want in ITS to allow MT engines to identify the third category, or You think that it is not relevant to ITS? Nowadays when formats get changed and overfilled with embedded information, I think it would be useful to be able to distinguish between all three categories and not just the two. Any thoughts?

Best regards,
Mārcis ;o)

From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
Sent: Wednesday, October 10, 2012 2:57 AM
To: public-multilingualweb-lt@w3.org
Subject: Re: [ISSUE-42] Wording for the tool information markup

Hi Mārcis, Felix
I'm not sure I fully understand the use case you are addressing with these translation enumeration extensions.

I know from Declan that with Moses, you can handle no translates just by marking the text as something to be translated as itself, so it still get physically processed by the engine, but this is simpler than removing the text (with some loss of context). So annotations designed to prevent 'unnecessary' machine translations may not be very worthwhile.

Is the use case more, therefore, that you want to alert the translation provider that the text probably won't be well translated by machine and should be prioritised for human translation or postediting?

Either way I'd reinforce Felix's point about the problems changing the translation enumeration. It would be a backward compatibility violation with ITS1.0, and a major one because there are several implementations using the existing yes/no enumeration.

The prioritisation of certain processes was actually a requirement we identified early on (coming from an open session we held at a MultilingualWeb workshop in Luxembourg): see:
http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#readiness


This might be a better route to meeting this use case.

cheers,
Dave




On 09/10/2012 14:29, Felix Sasaki wrote:
Hi Mārcis,
2012/10/9 Mārcis Pinnis <marcis.pinnis@tilde.lv<mailto:marcis.pinnis@tilde.lv>>
Hi, all,

(replied inline)

Best regards,
Mārcis ;o)

From: Tadej Štajner [mailto:tadej.stajner@ijs.si<mailto:tadej.stajner@ijs.si>]
Sent: Tuesday, October 09, 2012 3:02 PM
To: Felix Sasaki
Cc: Mārcis Pinnis; Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš; Andrejs Vasiļjevs

Subject: Re: [ISSUE-42] Wording for the tool information markup

Hi, all,
(reply inline)

On 09. 10. 2012 09<tel:09.%2010.%202012%2009>:15, Felix Sasaki wrote:
Hi Mārcis,
2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv<mailto:marcis.pinnis@tilde.lv>>
Hi Felix,

I believe that the “processInfo” (if renamed from “toolInfo”) will not overlap with provenance (although, I do not think that process is the right name – annotatorInfo would sound more reasonable). Provenance is something that is assigned to a term (a specific concept) by an authority and not the annotation or an annotation tool/user. That is, a user could mark a term, but he would not be responsible for the provenance of the term as that is assigned to the term in a term bank by someone with rights to do so (or the creator of the term). Also, provenance for terms is already given in a term bank, thus we would not need to standardize something that can be referenced to (following your thought of what can be referenced and what should be standardized). However, for automated processes it can be useful to know, how trustworthy an annotation is. This can be done in two ways – 1) follow a term bank reference and check the provenance for terms that are linked to a term bank entry; 2) decide based on the annotator, how trustworthy the term might be (for term candidates and terms not linked to a term bank entry).

I hope our understanding of what provenance in this case is does not differ (I am referring to term provenance)?! In the case if by provenance You meant something like the “annotation’s provenance”, then I agree that, by identifying the annotator, we will also add an annotation provenance. However, automated systems can benefit if the source of the content annotation can be identified (or at least traced...). What are your thoughts in this matter? How much do you want to ensure traceability in ITS?


I would like to keep the principle of disjunct data categories, and leave it to applications to interrelate provenance information for the content. Wrt to tracebility of ITS information, yes, I agree - that IMO would be the main use case for tool information. The question whether traceability can be assured "only" via an URI, see
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html


 Mārcis, Tadej, David,  ... any thoughts?


As I understand, we're dealing with:
1) provenance of term itself
2) provenance of an instance annotation of the term in some text

1 is probably out of scope, 2 is something that we'd cover by the toolInfo/processInfo attribute. Maybe 1) is also interesting in some cases, but I would speculate that it's rarely something I'd want to inline in a document with an annotation.

Also, would 'agent' be a clearer term for 'tool info' or 'process info'?

-- Tadej

1 is covered in term banks (or ... at least should be) and probably is out of scope as I understand it. Actually this is a data category that, if necessary, should be resolved by applications (programs/users) following the references to the term entries in a term bank (if such are given), thus the annotation should not be redundant.
For 2, I think Tadej’s idea about “agentInfo” is more appropriate than “toolInfo” or “processInfo”.

Felix

About Translate, I meant the understanding from a machine user’s perspective. For a machine user (MT system) 1) and 2) may be equally important and it would be good if the machine user would be able to distinguish the two types within a document. If I understand locNote correctly, this category is not meant for machine users, but rather human translators.
I agree with your statements about locNote, and I understand the need to distinguish the two types in a document. What you describe as 2) could be achieved by locale filter
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#LocaleFilter-implementation

e.g.
<its:rules version="2.0"> <its:localeFilterRule selector="//img" localeFilterList=""/> </its:rules>
This expresses that all "img" elements are not part of the localization workflow. Would that fulfil your needs?

I agree, this would do the trick. However, won’t this corrupt the data for other purposes (for instance, if in a table currencies would have to be converted (not translated) to a different locale currency by some specialists)? That is, I think that re-using of the locale filter for MT purposes might actually cause some other processes not to work... An easier solution, in my opinion, would be to make the Translate category enumerable (translate=”keep-as-is” or translate=”no”; translate=”yes”; translate=”ignore”, ignore being the indication that a segment would have to be ignored/skipped by a translation engine). Any thoughts on this?


I agree with your feedback about localeRule. However, overloading "translate" would cause a mismatch with other vocabularies that use a "translate" attribute: e.g. both DITA and HTML5 have a translate attribute in no or different namespace with the same semantics as ITS "translate". Adding more values would create a misalignment.

To get a feeling about the importance of this: who would implement an additional value for "translate" (or the meaning of "keep-as-is" in a separate data category) - who would need that use case?

Felix


Best,

Felix
Best regards,
Mārcis ;o)

From: Felix Sasaki [mailto:fsasaki@w3.org<mailto:fsasaki@w3.org>]
Sent: Thursday, October 04, 2012 6:40 PM

To: Mārcis Pinnis
Cc: Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš; Andrejs Vasiļjevs
Subject: Re: [ISSUE-42] Wording for the tool information markup

Hi Mārcis,

your mail did not reach the list. Just FIY, I think you were subscribed to the list with need to send it with
marcis.pinnis@Tilde.lv<mailto:marcis.pinnis@Tilde.lv> (with upper case "T" in tilde.) I changed that to marcis.pinnis@tilde.lv<mailto:marcis.pinnis@tilde.lv>, so your next mail should reach the list. Some comments below.

2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv<mailto:marcis.pinnis@tilde.lv>>
Dear Felix,

Thank you for the explanation. I see that the toolinfo can manage the identification of toos. But does ITS also require users (people) to be treated as tools.


We could rename "tool" to process - and would end up with provenance. But maybe that's sufficient.


That was not clear to me. Or, does ITS specify separate tags for identification of who/what added an annotation?

No, that's exactly the point: we don't have a way to specify "who created an annotation?". The purpose of "tool info" is just that. And it is - to use that nice word again - "orthogonal" to the data category annotation itself. That is, you want to relate it to its:term, but you don't want to repeat it all the time, and you don't want to make it mandatory.


I guess, it is clear that a “termConfidence” is necessary. And the “term” tag is required (the termCandidate can be ommited as that could potentially be redundant if a reference of the annotator or the authority of annotation is given).

On the Translate question maybe you can explain a bit more why, in your opinion, the 1) and 2) should be combined in a general meaning? They both describe data that has to be handled differently. The “Translate” category as I understand solves either 1) or 2) (and this depends on every implementation), but not both.


Yes, that was my point: we leave it to the implementation whether the implementation wants to handle 1) or 2). The main idea of ITS is specify really atomic metadata items.

Your requirement to differentiate 1) vs. 2) could e.g. be handled by a localization note:

<its:locNoteRule selector="//h:img" locNote="Drop this in the workflow, don't give it to translator"/>

But you are probably looking for a machine readable way to achieve this?

Best,

Felix


Best regards,
Mārcis.

From: Felix Sasaki [mailto:fsasaki@w3.org<mailto:fsasaki@w3.org>]
Sent: Thursday, October 04, 2012 3:58 PM
To: Mārcis Pinnis
Cc: Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš; Andrejs Vasiļjevs

Subject: Re: [ISSUE-42] Wording for the tool information markup


2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv<mailto:marcis.pinnis@tilde.lv>>
Dear Felix,

Having only the confidence distinguishing between an automatically identified term and a user approved term is not enough as various term annotation tools can have different confidence scores (they may be also in log form depending on the implementation). Thus having a strict value “1” for user approved/ term-bank based terms is not enough. In an ideal scenario, at least from my perspective, there should be a way to identify who (a system, which system, a user, who?, and authority, which authority?) annotated each term (not just in document level, but also in individual term level) and what is the confidence of the respective identifier given to the term candidate (or even a term).


Understand. That might bring us to "toolinfo" again. The solution that Yves mentioned at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html

would allow you to create identifiers for this complex type of information.


To make it a bit more simple, using only termConfidence to distinguish between user approved or trusted terms is not enough as the termConfidence is not reliable for such purposes.

A natural representation, in my opinion, would identify the “annotator” (using categories – term bank, user, automatic tool, authority), the term confidence and the ID of the “annotator” (in order to identify the annotator precisely).

Of course, for TermBank based terms there should be also a reference pointer so that more information could be identified.


Understand - the question mainly is: what needs to be standardized, and what could be a URI to that complex information.





Actually ... one question that is out of topic here ... I tried following your discussions about the MT related “Translate” data category and a question arose: do you distinguish between something that:

1)      has to be passed through a translation system, but should not be translated (should be kept as is, but is helpful for disambiguation of the translatable parts);

2)      has to be completely ignored and not even passed through a translation system (for instance, numbers in tables, encrypted images within HTML5, etc.).

From what I have understood (maybe I did not get the full picture) – the “Translate” tag is meant only for an MT system to tell it that something has to be kept as is, but some parts could be irrelevant to send through the MT systems, but that is not solved by the Translate tag.

"Translate" in fact is very general and doesn't distinguish between 1) and 2). E.g. IIRC, in Okapi it is used also to create pseudo translated text.

Best,

Felix


Best regards,
Mārcis Pinnis
Researcher
Tilde

From: Felix Sasaki [mailto:fsasaki@w3.org<mailto:fsasaki@w3.org>]
Sent: Thursday, October 04, 2012 2:54 PM
To: Tatiana Gornostay
Cc: Yves Savourel; public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Mārcis Pinnis; Raivis Skadiņš; Andrejs Vasiļjevs

Subject: Re: [ISSUE-42] Wording for the tool information markup

Dear Tatiana, all,
2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv<mailto:tatiana.gornostay@tilde.lv>>
Dear Felix, Yves, Dear All,


W.r.t. the ongoing discussion on toolInfo and mtConfidence, I have in mind the following potential attributes proposed by Tilde in view of terminology use case, I mean, its-termInfoRef, its-termCandidate, and its-termConfidence and their values.

Would it also work to just add "termConfidence" to

http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation


we then could say: something is a term then the confidence is 1, that is
<span its:term="yes" its:termInfoRef="...">...</span> (ITS 1.0 or ITS 2.0)
is equal to
<span its:term="yes" its:termInfoRef="..." termConfidence="1">...</span> (ITS 2.0)
and a term candidate would be
<span its:term="yes" its:termInfoRef="..." termConfidence="0.9">...</span> (ITS 2.0)

Felix

These are not represented in the current draft  and if we go this way then we will have to discuss and, probably, add them. I can remember that Tadej raised this  questionin Prague and we did not talk about it, unfortunately. On the other hand, as soon as we start the project we will have opportunity and time to do it and my colleagues will also join the discussion.



With best wishes,

Tatiana

From: Felix Sasaki [mailto:fsasaki@w3.org<mailto:fsasaki@w3.org>]
Sent: Wednesday, October 03, 2012 12:29 AM
To: Yves Savourel
Cc: public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>

Subject: Re: [ISSUE-42] Wording for the tool information markup

Hi Yves, all,

no opinion on my side on the delimiter topic, sorry for bringing it up. A comment on the tool specific aspect below.
2012/10/2 Yves Savourel <ysavourel@enlaso.com<mailto:ysavourel@enlaso.com>>
> <doc its:toolRefs="mtConfidence/file:///tools.xml#T1"
> xlmns:its="http://www.w3.org/2005/11/its">
>
> Would it make sense to use a different delimiter? "/" may conflict with "/" in paths.
Hmm... almost any ASCII delimiter may also be in the path. The first occurrence is the delimiter.
But I suppose '|' could be used instead. It just doesn't look as graceful for some reason.


> Do you need the "dataCategory" attribute? It seems the
> data category is made explicit via the reference mechanism in "its:toolRefs".
> Also, dropping the "dataCategory" attribute allows then to refer to
> the same tools from various data categories - e.g. OKAPI used for quality
> issue versus for creating translation metadata etc.
I'm not sure we can go from many data category instances to one tool information. And this is where I'm having trouble with tool information:

The mtConfidence need to have a defined way to specify the engine used

Is there really a defined way? The current version of the draft at
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation

says:

"Some examples of values are:
A BCP 47 language tag with t-extension, e.g. ja-t-it for an Italian to Japanese MT engine
A Domain as per the Section 6.9: Domain
A privately structured string, eg. Domain:IT-Pair:IT-JA, IT-JA:Medical, etc."

To me that is the same as saying: you can use anything. Of course we can wrap the "anything" in a field saying "here is MT engine information". Is that what you mean?


, the Text analysis may need something else

I actually doubt that the text analysis "anything" will be more specific. My prediction is that there will be not more interop than saying "in this field there is data category specific information: ...".

So you could achieve that by changing your proposal like this



<its:processInfo>





 <its:toolInfo xml:id="T1">

  <its:toolName>Bing Translator</its:toolName>

  <its:toolVersion>123</its:toolVersion>

  <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>









 <its:toolInfo>

 <its:toolInfo xml:id="T2">

  <its:toolName>myMT</its:toolName>

  <its:toolVersion>456</its:toolVersion>

  <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>



 <its:toolInfo>







<its:processInfo>

and allow for several addInfo elements in one "toolInfo". You won't gain a lot from these, but not less as with "FR-to-EN-General" inside "toolValue" at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html


Best,

Felix


, etc. It seems each data category will need one or two entry that mean different things depending on the data category. We can use a common element for this, but then we need to have one tool information per data category.

Maybe the examples people are working on (action items 239 to 243 for Arle, Phil, Declan and Tadej) will help in defining this.

Cheers
-yves



--
Felix Sasaki
DFKI / W3C Fellow




--
Felix Sasaki
DFKI / W3C Fellow




--
Felix Sasaki
DFKI / W3C Fellow




--
Felix Sasaki
DFKI / W3C Fellow




--
Felix Sasaki
DFKI / W3C Fellow





--
Felix Sasaki
DFKI / W3C Fellow

Received on Thursday, 11 October 2012 06:01:19 UTC