Re: [ISSUE-42] Wording for the tool information markup from Dave Lewis on 2012-10-09 (public-multilingualweb-lt@w3.org from October 2012)

From: Dave Lewis <dave.lewis@cs.tcd.ie>
Date: Wed, 10 Oct 2012 00:57:24 +0100
To: public-multilingualweb-lt@w3.org
Message-ID: <5074B9E4.4020108@cs.tcd.ie>
Hi Mārcis, Felix
I'm not sure I fully understand the use case you are addressing with 
these translation enumeration extensions.

I know from Declan that with Moses, you can handle no translates just by 
marking the text as something to be translated as itself, so it still 
get physically processed by the engine, but this is simpler than 
removing the text (with some loss of context). So annotations designed 
to prevent 'unnecessary' machine translations may not be very worthwhile.

Is the use case more, therefore, that you want to alert the translation 
provider that the text probably won't be well translated by machine and 
should be prioritised for human translation or postediting?

Either way I'd reinforce Felix's point about the problems changing the 
translation enumeration. It would be a backward compatibility violation 
with ITS1.0, and a major one because there are several implementations 
using the existing yes/no enumeration.

The prioritisation of certain processes was actually a requirement we 
identified early on (coming from an open session we held at a 
MultilingualWeb workshop in Luxembourg): see:
http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#readiness

This might be a better route to meeting this use case.

cheers,
Dave




On 09/10/2012 14:29, Felix Sasaki wrote:
> Hi Mārcis,
>
> 2012/10/9 Mārcis Pinnis <marcis.pinnis@tilde.lv 
> <mailto:marcis.pinnis@tilde.lv>>
>
>     Hi, all,
>
>     (replied inline)
>
>     Best regards,
>
>     Mārcis ;o)
>
>     *From:*Tadej Štajner [mailto:tadej.stajner@ijs.si
>     <mailto:tadej.stajner@ijs.si>]
>     *Sent:* Tuesday, October 09, 2012 3:02 PM
>     *To:* Felix Sasaki
>     *Cc:* Mārcis Pinnis; Tatiana Gornostay; Yves Savourel;
>     public-multilingualweb-lt@w3.org
>     <mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš; Andrejs
>     Vasiļjevs
>
>
>     *Subject:* Re: [ISSUE-42] Wording for the tool information markup
>
>     Hi, all,
>
>     (reply inline)
>
>     On 09. 10. 2012 09 <tel:09.%2010.%202012%2009>:15, Felix Sasaki wrote:
>
>         Hi Mārcis,
>
>         2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv
>         <mailto:marcis.pinnis@tilde.lv>>
>
>         Hi Felix,
>
>         I believe that the “processInfo” (if renamed from “toolInfo”)
>         will not overlap with provenance (although, I do not think
>         that process is the right name – annotatorInfo would sound
>         more reasonable). Provenance is something that is assigned to
>         a term (a specific concept) by an authority and not the
>         annotation or an annotation tool/user. That is, a user could
>         mark a term, but he would not be responsible for the
>         provenance of the term as that is assigned to the term in a
>         term bank by someone with rights to do so (or the creator of
>         the term). Also, provenance for terms is already given in a
>         term bank, thus we would not need to standardize something
>         that can be referenced to (following your thought of what can
>         be referenced and what should be standardized). However, for
>         automated processes it can be useful to know, how trustworthy
>         an annotation is. This can be done in two ways – 1) follow a
>         term bank reference and check the provenance for terms that
>         are linked to a term bank entry; 2) decide based on the
>         annotator, how trustworthy the term might be (for term
>         candidates and terms not linked to a term bank entry).
>
>         I hope our understanding of what provenance in this case is
>         does not differ (I am referring to term provenance)?! In the
>         case if by provenance You meant something like the
>         “annotation’s provenance”, then I agree that, by identifying
>         the annotator, we will also add an annotation provenance.
>         However, automated systems can benefit if the source of the
>         content annotation can be identified (or at least traced...).
>         What are your thoughts in this matter? How much do you want to
>         ensure traceability in ITS?
>
>         I would like to keep the principle of disjunct data
>         categories, and leave it to applications to interrelate
>         provenance information for the content. Wrt to tracebility of
>         ITS information, yes, I agree - that IMO would be the main use
>         case for tool information. The question whether traceability
>         can be assured "only" via an URI, see
>
>         http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
>
>          Mārcis, Tadej, David,  ... any thoughts?
>
>
>     As I understand, we're dealing with:
>     1) provenance of term itself
>     2) provenance of an instance annotation of the term in some text
>
>     1 is probably out of scope, 2 is something that we'd cover by the
>     toolInfo/processInfo attribute. Maybe 1) is also interesting in
>     some cases, but I would speculate that it's rarely something I'd
>     want to inline in a document with an annotation.
>
>     Also, would 'agent' be a clearer term for 'tool info' or 'process
>     info'?
>
>     -- Tadej
>
>     1 is covered in term banks (or ... at least should be) and
>     probably is out of scope as I understand it. Actually this is a
>     data category that, if necessary, should be resolved by
>     applications (programs/users) following the references to the term
>     entries in a term bank (if such are given), thus the annotation
>     should not be redundant.
>
>     For 2, I think Tadej’s idea about “agentInfo” is more appropriate
>     than “toolInfo” or “processInfo”.
>
>
>
>     Felix
>
>     About Translate, I meant the understanding from a machine user’s
>     perspective. For a machine user (MT system) 1) and 2) may be
>     equally important and it would be good if the machine user would
>     be able to distinguish the two types within a document. If I
>     understand locNote correctly, this category is not meant for
>     machine users, but rather human translators.
>
>     I agree with your statements about locNote, and I understand the
>     need to distinguish the two types in a document. What you describe
>     as 2) could be achieved by locale filter
>
>     http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#LocaleFilter-implementation
>
>     e.g.
>
>     <its:rules version="2.0"> <its:localeFilterRule selector="//img"
>     localeFilterList=""/> </its:rules>
>
>     This expresses that all "img" elements are not part of the
>     localization workflow. Would that fulfil your needs?
>
>     I agree, this would do the trick. However, won’t this corrupt the
>     data for other purposes (for instance, if in a table currencies
>     would have to be converted (not translated) to a different locale
>     currency by some specialists)? That is, I think that re-using of
>     the locale filter for MT purposes might actually cause some other
>     processes not to work... An easier solution, in my opinion, would
>     be to make the Translate category enumerable
>     (translate=”keep-as-is” or translate=”no”; translate=”yes”;
>     translate=”ignore”, ignore being the indication that a segment
>     would have to be ignored/skipped by a translation engine). Any
>     thoughts on this?
>
>
>
> I agree with your feedback about localeRule. However, overloading 
> "translate" would cause a mismatch with other vocabularies that use a 
> "translate" attribute: e.g. both DITA and HTML5 have a translate 
> attribute in no or different namespace with the same semantics as ITS 
> "translate". Adding more values would create a misalignment.
>
> To get a feeling about the importance of this: who would implement an 
> additional value for "translate" (or the meaning of "keep-as-is" in a 
> separate data category) - who would need that use case?
>
> Felix
>
>     Best,
>
>     Felix
>
>         Best regards,
>
>         Mārcis ;o)
>
>         *From:*Felix Sasaki [mailto:fsasaki@w3.org
>         <mailto:fsasaki@w3.org>]
>         *Sent:* Thursday, October 04, 2012 6:40 PM
>
>
>         *To:* Mārcis Pinnis
>         *Cc:* Tatiana Gornostay; Yves Savourel;
>         public-multilingualweb-lt@w3.org
>         <mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš;
>         Andrejs Vasiļjevs
>         *Subject:* Re: [ISSUE-42] Wording for the tool information markup
>
>         Hi Mārcis,
>
>         your mail did not reach the list. Just FIY, I think you were
>         subscribed to the list with need to send it with
>
>         marcis.pinnis@Tilde.lv <mailto:marcis.pinnis@Tilde.lv> (with
>         upper case "T" in tilde.) I changed that to
>         marcis.pinnis@tilde.lv <mailto:marcis.pinnis@tilde.lv>, so
>         your next mail should reach the list. Some comments below.
>
>         2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv
>         <mailto:marcis.pinnis@tilde.lv>>
>
>         Dear Felix,
>
>         Thank you for the explanation. I see that the toolinfo can
>         manage the identification of toos. But does ITS also require
>         users (people) to be treated as tools.
>
>         We could rename "tool" to process - and would end up with
>         provenance. But maybe that's sufficient.
>
>             That was not clear to me. Or, does ITS specify separate
>             tags for identification of who/what added an annotation?
>
>         No, that's exactly the point: we don't have a way to specify
>         "who created an annotation?". The purpose of "tool info" is
>         just that. And it is - to use that nice word again -
>         "orthogonal" to the data category annotation itself. That is,
>         you want to relate it to its:term, but you don't want to
>         repeat it all the time, and you don't want to make it mandatory.
>
>             I guess, it is clear that a “termConfidence” is necessary.
>             And the “term” tag is required (the termCandidate can be
>             ommited as that could potentially be redundant if a
>             reference of the annotator or the authority of annotation
>             is given).
>
>             On the Translate question maybe you can explain a bit more
>             why, in your opinion, the 1) and 2) should be combined in
>             a general meaning? They both describe data that has to be
>             handled differently. The “Translate” category as I
>             understand solves either 1) or 2) (and this depends on
>             every implementation), but not both.
>
>         Yes, that was my point: we leave it to the implementation
>         whether the implementation wants to handle 1) or 2). The main
>         idea of ITS is specify really atomic metadata items.
>
>         Your requirement to differentiate 1) vs. 2) could e.g. be
>         handled by a localization note:
>
>         <its:locNoteRule selector="//h:img" locNote="Drop this in the
>         workflow, don't give it to translator"/>
>
>         But you are probably looking for a machine readable way to
>         achieve this?
>
>         Best,
>
>         Felix
>
>             Best regards,
>
>             Mārcis.
>
>             *From:*Felix Sasaki [mailto:fsasaki@w3.org
>             <mailto:fsasaki@w3.org>]
>             *Sent:* Thursday, October 04, 2012 3:58 PM
>             *To:* Mārcis Pinnis
>             *Cc:* Tatiana Gornostay; Yves Savourel;
>             public-multilingualweb-lt@w3.org
>             <mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš;
>             Andrejs Vasiļjevs
>
>
>             *Subject:* Re: [ISSUE-42] Wording for the tool information
>             markup
>
>             2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv
>             <mailto:marcis.pinnis@tilde.lv>>
>
>             Dear Felix,
>
>             Having only the confidence distinguishing between an
>             automatically identified term and a user approved term is
>             not enough as various term annotation tools can have
>             different confidence scores (they may be also in log form
>             depending on the implementation). Thus having a strict
>             value “1” for user approved/ term-bank based terms is not
>             enough. In an ideal scenario, at least from my
>             perspective, there should be a way to identify who (a
>             system, which system, a user, who?, and authority, which
>             authority?) annotated each term (not just in document
>             level, but also in individual term level) and what is the
>             confidence of the respective identifier given to the term
>             candidate (or even a term).
>
>             Understand. That might bring us to "toolinfo" again. The
>             solution that Yves mentioned at
>
>             http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
>
>             would allow you to create identifiers for this complex
>             type of information.
>
>                 To make it a bit more simple, using only
>                 termConfidence to distinguish between user approved or
>                 trusted terms is not enough as the termConfidence is
>                 not reliable for such purposes.
>
>                 A natural representation, in my opinion, would
>                 identify the “annotator” (using categories – term
>                 bank, user, automatic tool, authority), the term
>                 confidence and the ID of the “annotator” (in order to
>                 identify the annotator precisely).
>
>                 Of course, for TermBank based terms there should be
>                 also a reference pointer so that more information
>                 could be identified.
>
>             Understand - the question mainly is: what needs to be
>             standardized, and what could be a URI to that complex
>             information.
>
>                 Actually ... one question that is*out of topic *here
>                 ... I tried following your discussions about the MT
>                 related “Translate” data category and a question
>                 arose: do you distinguish between something that:
>
>                 1)has to be passed through a translation system, but
>                 should not be translated (should be kept as is, but is
>                 helpful for disambiguation of the translatable parts);
>
>                 2)has to be completely ignored and not even passed
>                 through a translation system (for instance, numbers in
>                 tables, encrypted images within HTML5, etc.).
>
>                 From what I have understood (maybe I did not get the
>                 full picture) – the “Translate” tag is meant only for
>                 an MT system to tell it that something has to be kept
>                 as is, but some parts could be irrelevant to send
>                 through the MT systems, but that is not solved by the
>                 Translate tag.
>
>             "Translate" in fact is very general and doesn't
>             distinguish between 1) and 2). E.g. IIRC, in Okapi it is
>             used also to create pseudo translated text.
>
>             Best,
>
>
>             Felix
>
>                 Best regards,
>
>                 Mārcis Pinnis
>
>                 Researcher
>
>                 Tilde
>
>                 *From:*Felix Sasaki [mailto:fsasaki@w3.org
>                 <mailto:fsasaki@w3.org>]
>                 *Sent:* Thursday, October 04, 2012 2:54 PM
>                 *To:* Tatiana Gornostay
>                 *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org
>                 <mailto:public-multilingualweb-lt@w3.org>; Mārcis
>                 Pinnis; Raivis Skadiņš; Andrejs Vasiļjevs
>
>
>                 *Subject:* Re: [ISSUE-42] Wording for the tool
>                 information markup
>
>                 Dear Tatiana, all,
>
>                 2012/10/3 Tatiana Gornostay
>                 <tatiana.gornostay@tilde.lv
>                 <mailto:tatiana.gornostay@tilde.lv>>
>
>                 Dear Felix, Yves, Dear All,
>
>                 W.r.t. the ongoing discussion on /toolInfo/ and
>                 /mtConfidence/, I have in mind the following potential
>                 attributes proposed by Tilde in view of terminology
>                 use case, I mean, /its-termInfoRef/,
>                 /its-termCandidate/, and /its-termConfidence/ and
>                 their values.
>
>                 Would it also work to just add "termConfidence" to
>
>                 http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation
>
>                 we then could say: something is a term then the
>                 confidence is 1, that is
>
>                 <span its:term="yes" its:termInfoRef="...">...</span>
>                 (ITS 1.0 or ITS 2.0)
>
>                 is equal to
>
>                 <span its:term="yes" its:termInfoRef="..."
>                 termConfidence="1">...</span> (ITS 2.0)
>
>                 and a term candidate would be
>
>                 <span its:term="yes" its:termInfoRef="..."
>                 termConfidence="0.9">...</span> (ITS 2.0)
>
>                 Felix
>
>                     These are not represented in the current draft
>                      and if we go this way then we will have to
>                     discuss and, probably, add them. I can remember
>                     that Tadej raised this  questionin Prague and we
>                     did not talk about it, unfortunately. On the other
>                     hand, as soon as we start the project we will have
>                     opportunity and time to do it and my colleagues
>                     will also join the discussion.
>
>                     With best wishes,
>
>                     Tatiana
>
>                     *From:*Felix Sasaki [mailto:fsasaki@w3.org
>                     <mailto:fsasaki@w3.org>]
>                     *Sent:* Wednesday, October 03, 2012 12:29 AM
>                     *To:* Yves Savourel
>                     *Cc:* public-multilingualweb-lt@w3.org
>                     <mailto:public-multilingualweb-lt@w3.org>
>
>
>                     *Subject:* Re: [ISSUE-42] Wording for the tool
>                     information markup
>
>                     Hi Yves, all,
>
>                     no opinion on my side on the delimiter topic,
>                     sorry for bringing it up. A comment on the tool
>                     specific aspect below.
>
>                     2012/10/2 Yves Savourel <ysavourel@enlaso.com
>                     <mailto:ysavourel@enlaso.com>>
>
>                     > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1"
>                     > xlmns:its="http://www.w3.org/2005/11/its">
>                     >
>
>                     > Would it make sense to use a different delimiter? "/" may conflict
>                     with "/" in paths.
>
>                     Hmm... almost any ASCII delimiter may also be in
>                     the path. The first occurrence is the delimiter.
>                     But I suppose '|' could be used instead. It just
>                     doesn't look as graceful for some reason.
>
>
>
>                     > Do you need the "dataCategory" attribute? It
>                     seems the
>                     > data category is made explicit via the reference
>                     mechanism in "its:toolRefs".
>                     > Also, dropping the "dataCategory" attribute
>                     allows then to refer to
>                     > the same tools from various data categories -
>                     e.g. OKAPI used for quality
>                     > issue versus for creating translation metadata etc.
>
>                     I'm not sure we can go from many data category
>                     instances to one tool information. And this is
>                     where I'm having trouble with tool information:
>
>                     The mtConfidence need to have a defined way to
>                     specify the engine used
>
>                     Is there really a defined way? The current version
>                     of the draft at
>
>                     http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation
>
>                     says:
>
>                     "Some examples of values are:
>
>                     A BCP 47 language tag with t-extension, e.g.
>                     ja-t-it for an Italian to Japanese MT engine
>
>                     A Domain as per the Section 6.9: Domain
>
>                     A privately structured string, eg.
>                     Domain:IT-Pair:IT-JA, IT-JA:Medical, etc."
>
>                     To me that is the same as saying: you can use
>                     anything. Of course we can wrap the "anything" in
>                     a field saying "here is MT engine information". Is
>                     that what you mean?
>
>                         , the Text analysis may need something else
>
>                     I actually doubt that the text analysis "anything"
>                     will be more specific. My prediction is that there
>                     will be not more interop than saying "in this
>                     field there is data category specific information:
>                     ...".
>
>                     So you could achieve that by changing your
>                     proposal like this
>
>                       
>
>                     <its:processInfo>
>
>                       
>
>                       
>
>                       <its:toolInfo xml:id="T1">
>
>                        <its:toolName>Bing Translator</its:toolName>
>
>                        <its:toolVersion>123</its:toolVersion>
>
>                        <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>
>
>                       
>
>                       
>
>                       
>
>                       
>
>                       <its:toolInfo>
>
>                       <its:toolInfo xml:id="T2">
>
>                        <its:toolName>myMT</its:toolName>
>
>                        <its:toolVersion>456</its:toolVersion>
>
>                        <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>
>
>                       
>
>                       <its:toolInfo>
>
>                       
>
>                       
>
>                       
>
>                     <its:processInfo>
>
>                     and allow for several addInfo elements in one
>                     "toolInfo". You won't gain a lot from these, but
>                     not less as with "FR-to-EN-General" inside
>                     "toolValue" at
>
>                     http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html
>
>                     Best,
>
>                     Felix
>
>                         , etc. It seems each data category will need
>                         one or two entry that mean different things
>                         depending on the data category. We can use a
>                         common element for this, but then we need to
>                         have one tool information per data category.
>
>                         Maybe the examples people are working on
>                         (action items 239 to 243 for Arle, Phil,
>                         Declan and Tadej) will help in defining this.
>
>                         Cheers
>                         -yves
>
>
>
>                     -- 
>                     Felix Sasaki
>
>                     DFKI / W3C Fellow
>
>
>
>                 -- 
>                 Felix Sasaki
>
>                 DFKI / W3C Fellow
>
>
>
>             -- 
>             Felix Sasaki
>
>             DFKI / W3C Fellow
>
>
>
>         -- 
>         Felix Sasaki
>
>         DFKI / W3C Fellow
>
>
>
>     -- 
>     Felix Sasaki
>
>     DFKI / W3C Fellow
>
>
>
>
> -- 
> Felix Sasaki
> DFKI / W3C Fellow
>
Received on Tuesday, 9 October 2012 23:57:53 UTC