Re: [ISSUE-42] Wording for the tool information markup from Felix Sasaki on 2012-10-11 (public-multilingualweb-lt@w3.org from October 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Thu, 11 Oct 2012 14:49:26 +0200
To: Mārcis Pinnis <marcis.pinnis@tilde.lv>
Cc: Dave Lewis <dave.lewis@cs.tcd.ie>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <CAL58czq6RbtxeSHoWTo491OBTAF3=n3KVgPnr5dK1aydTR9zuA@mail.gmail.com>
Hi Mārcis, all,

2012/10/11 Mārcis Pinnis <marcis.pinnis@tilde.lv>

> Hi Dave,****
>
> ** **
>
> With the third option I mean the situation when you have, for instance,
> embedded in the data (what format or what tags, does not actually matter)
> some information (let’s say 5MB of encoded data), which  should never be
> processed with a translation engine as that would be useless waste of
> computational resources (with large amounts of such information also
> sometimes raise stability issues... and require much more intensive
> development efforts to make systems stable enough). If you do process it
> and say that it is useful context, but keep the translation as is, you
> actually ask the MT engine to deal with such maybe vast amounts of data and
> use it for contextual information. But ... it may even not contain any
> useful contextual information.****
>
> ** **
>
> In my opinion, when building a Web access MT system, I personally would
> divide all data in three groups: 1) translatable, 2) non-translatable with
> useful contextual information, 3) non-translatable with no useful
> contextual information (ignorable).****
>
> ** **
>
> The question is, whether you want in ITS to allow MT engines to identify
> the third category, or You think that it is not relevant to ITS? Nowadays
> when formats get changed and overfilled with embedded information, I think
> it would be useful to be able to distinguish between all three categories
> and not just the two. Any thoughts?****
>
> **
>

We may run in circles a bit ... but let the summarize the background: We
cannot change translate, there is too many existing MT tools (e.g. online
MT systems) or also localization tools (without any MT), and formats
(HTML5, DITA, ...) that rely on just two values yes and no.

So we can continue to discuss "translate", but it cannot be changed for
above reasons.

Now, your use case 3) could be realized with a combination of data
categories. The combination translate + localeFilter is probably a bad
choice, but how about translate (or not translate) + provenance? We soon
will have a  draft of provenance, so maybe we can develop examples from
where.

The bottom line is that you don't want to hardwire such combinations of
data categories - the basic idea of ITS is that data categories are
"atomic" in the sense of: really convey a minimum piece of information, to
be used in many different workflows (both e.g. human translation, MT, or no
translation at all).

Best,

Felix





> **
>
> Best regards,****
>
> Mārcis ;o)****
>
> ** **
>
> *From:* Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> *Sent:* Wednesday, October 10, 2012 2:57 AM
> *To:* public-multilingualweb-lt@w3.org
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
> ** **
>
> Hi Mārcis, Felix
> I'm not sure I fully understand the use case you are addressing with these
> translation enumeration extensions.
>
> I know from Declan that with Moses, you can handle no translates just by
> marking the text as something to be translated as itself, so it still get
> physically processed by the engine, but this is simpler than removing the
> text (with some loss of context). So annotations designed to prevent
> 'unnecessary' machine translations may not be very worthwhile.
>
> Is the use case more, therefore, that you want to alert the translation
> provider that the text probably won't be well translated by machine and
> should be prioritised for human translation or postediting?
>
> Either way I'd reinforce Felix's point about the problems changing the
> translation enumeration. It would be a backward compatibility violation
> with ITS1.0, and a major one because there are several implementations
> using the existing yes/no enumeration.
>
> The prioritisation of certain processes was actually a requirement we
> identified early on (coming from an open session we held at a
> MultilingualWeb workshop in Luxembourg): see:
>
> http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#readiness
>
> This might be a better route to meeting this use case.
>
> cheers,
> Dave
>
>
>
>
> On 09/10/2012 14:29, Felix Sasaki wrote:****
>
> Hi Mārcis,****
>
> 2012/10/9 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Hi, all,****
>
>  ****
>
> (replied inline)****
>
>  ****
>
> Best regards,****
>
> Mārcis ;o)****
>
>  ****
>
> *From:* Tadej Štajner [mailto:tadej.stajner@ijs.si]
> *Sent:* Tuesday, October 09, 2012 3:02 PM
> *To:* Felix Sasaki
> *Cc:* Mārcis Pinnis; Tatiana Gornostay; Yves Savourel;
> public-multilingualweb-lt@w3.org; Raivis Skadiņš; Andrejs Vasiļjevs****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi, all, ****
>
> (reply inline)
>
> On 09. 10. 2012 09:15, Felix Sasaki wrote:****
>
> Hi Mārcis,****
>
> 2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Hi Felix,****
>
>  ****
>
> I believe that the “processInfo” (if renamed from “toolInfo”) will not
> overlap with provenance (although, I do not think that process is the right
> name – annotatorInfo would sound more reasonable). Provenance is something
> that is assigned to a term (a specific concept) by an authority and not the
> annotation or an annotation tool/user. That is, a user could mark a term,
> but he would not be responsible for the provenance of the term as that is
> assigned to the term in a term bank by someone with rights to do so (or the
> creator of the term). Also, provenance for terms is already given in a term
> bank, thus we would not need to standardize something that can be
> referenced to (following your thought of what can be referenced and what
> should be standardized). However, for automated processes it can be useful
> to know, how trustworthy an annotation is. This can be done in two ways –
> 1) follow a term bank reference and check the provenance for terms that are
> linked to a term bank entry; 2) decide based on the annotator, how
> trustworthy the term might be (for term candidates and terms not linked to
> a term bank entry).****
>
>  ****
>
> I hope our understanding of what provenance in this case is does not
> differ (I am referring to term provenance)?! In the case if by provenance
> You meant something like the “annotation’s provenance”, then I agree that,
> by identifying the annotator, we will also add an annotation provenance.
> However, automated systems can benefit if the source of the content
> annotation can be identified (or at least traced...). What are your
> thoughts in this matter? How much do you want to ensure traceability in ITS?
> ****
>
>  ****
>
>  ****
>
> I would like to keep the principle of disjunct data categories, and leave
> it to applications to interrelate provenance information for the content.
> Wrt to tracebility of ITS information, yes, I agree - that IMO would be the
> main use case for tool information. The question whether traceability can
> be assured "only" via an URI, see****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
> ****
>
>  ****
>
>  Mārcis, Tadej, David,  ... any thoughts?****
>
>  ****
>
>
> As I understand, we're dealing with:
> 1) provenance of term itself
> 2) provenance of an instance annotation of the term in some text
>
> 1 is probably out of scope, 2 is something that we'd cover by the
> toolInfo/processInfo attribute. Maybe 1) is also interesting in some cases,
> but I would speculate that it's rarely something I'd want to inline in a
> document with an annotation.
>
> Also, would 'agent' be a clearer term for 'tool info' or 'process info'?
>
> -- Tadej****
>
>  ****
>
> 1 is covered in term banks (or ... at least should be) and probably is out
> of scope as I understand it. Actually this is a data category that, if
> necessary, should be resolved by applications (programs/users) following
> the references to the term entries in a term bank (if such are given), thus
> the annotation should not be redundant.****
>
> For 2, I think Tadej’s idea about “agentInfo” is more appropriate than
> “toolInfo” or “processInfo”.****
>
> ** **
>
> Felix****
>
>  ****
>
> About Translate, I meant the understanding from a machine user’s
> perspective. For a machine user (MT system) 1) and 2) may be equally
> important and it would be good if the machine user would be able to
> distinguish the two types within a document. If I understand locNote
> correctly, this category is not meant for machine users, but rather human
> translators.****
>
> I agree with your statements about locNote, and I understand the need to
> distinguish the two types in a document. What you describe as 2) could be
> achieved by locale filter****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#LocaleFilter-implementation
> ****
>
> e.g. ****
>
> <its:rules version="2.0"> <its:localeFilterRule selector="//img"
> localeFilterList=""/> </its:rules>****
>
> This expresses that all "img" elements are not part of the localization
> workflow. Would that fulfil your needs?****
>
>  ****
>
> I agree, this would do the trick. However, won’t this corrupt the data for
> other purposes (for instance, if in a table currencies would have to be
> converted (not translated) to a different locale currency by some
> specialists)? That is, I think that re-using of the locale filter for MT
> purposes might actually cause some other processes not to work... An easier
> solution, in my opinion, would be to make the Translate category enumerable
> (translate=”keep-as-is” or translate=”no”; translate=”yes”;
> translate=”ignore”, ignore being the indication that a segment would have
> to be ignored/skipped by a translation engine). Any thoughts on this?****
>
> ** **
>
> ** **
>
> I agree with your feedback about localeRule. However, overloading
> "translate" would cause a mismatch with other vocabularies that use a
> "translate" attribute: e.g. both DITA and HTML5 have a translate attribute
> in no or different namespace with the same semantics as ITS "translate".
> Adding more values would create a misalignment. ****
>
> ** **
>
> To get a feeling about the importance of this: who would implement an
> additional value for "translate" (or the meaning of "keep-as-is" in a
> separate data category) - who would need that use case?****
>
> ** **
>
> Felix****
>
>  ****
>
>  ****
>
> Best,****
>
>  ****
>
> Felix****
>
> Best regards,****
>
> Mārcis ;o)****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 6:40 PM****
>
>
> *To:* Mārcis Pinnis
> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org;
> Raivis Skadiņš; Andrejs Vasiļjevs
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi Mārcis,****
>
>  ****
>
> your mail did not reach the list. Just FIY, I think you were subscribed to
> the list with need to send it with****
>
> marcis.pinnis@Tilde.lv (with upper case "T" in tilde.) I changed that to
> marcis.pinnis@tilde.lv, so your next mail should reach the list. Some
> comments below. ****
>
>  ****
>
> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Dear Felix,****
>
>  ****
>
> Thank you for the explanation. I see that the toolinfo can manage the
> identification of toos. But does ITS also require users (people) to be
> treated as tools. ****
>
>  ****
>
>  ****
>
> We could rename "tool" to process - and would end up with provenance. But
> maybe that's sufficient. ****
>
>  ****
>
>  ****
>
> That was not clear to me. Or, does ITS specify separate tags for
> identification of who/what added an annotation?****
>
>  ****
>
> No, that's exactly the point: we don't have a way to specify "who created
> an annotation?". The purpose of "tool info" is just that. And it is - to
> use that nice word again - "orthogonal" to the data category annotation
> itself. That is, you want to relate it to its:term, but you don't want to
> repeat it all the time, and you don't want to make it mandatory.****
>
>  ****
>
>  ****
>
> I guess, it is clear that a “termConfidence” is necessary. And the “term”
> tag is required (the termCandidate can be ommited as that could potentially
> be redundant if a reference of the annotator or the authority of annotation
> is given).****
>
>  ****
>
> On the Translate question maybe you can explain a bit more why, in your
> opinion, the 1) and 2) should be combined in a general meaning? They both
> describe data that has to be handled differently. The “Translate” category
> as I understand solves either 1) or 2) (and this depends on every
> implementation), but not both.****
>
>  ****
>
>  ****
>
> Yes, that was my point: we leave it to the implementation whether the
> implementation wants to handle 1) or 2). The main idea of ITS is specify
> really atomic metadata items. ****
>
>  ****
>
> Your requirement to differentiate 1) vs. 2) could e.g. be handled by a
> localization note:****
>
>  ****
>
> <its:locNoteRule selector="//h:img" locNote="Drop this in the workflow,
> don't give it to translator"/>****
>
>  ****
>
> But you are probably looking for a machine readable way to achieve this?**
> **
>
>  ****
>
> Best,****
>
>  ****
>
> Felix ****
>
>  ****
>
>  ****
>
> Best regards,****
>
> Mārcis.****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 3:58 PM
> *To:* Mārcis Pinnis
> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org;
> Raivis Skadiņš; Andrejs Vasiļjevs****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
>  ****
>
> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Dear Felix,****
>
>  ****
>
> Having only the confidence distinguishing between an automatically
> identified term and a user approved term is not enough as various term
> annotation tools can have different confidence scores (they may be also in
> log form depending on the implementation). Thus having a strict value “1”
> for user approved/ term-bank based terms is not enough. In an ideal
> scenario, at least from my perspective, there should be a way to identify
> who (a system, which system, a user, who?, and authority, which authority?)
> annotated each term (not just in document level, but also in individual
> term level) and what is the confidence of the respective identifier given
> to the term candidate (or even a term).****
>
>  ****
>
>  ****
>
> Understand. That might bring us to "toolinfo" again. The solution that
> Yves mentioned at****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
> ****
>
> would allow you to create identifiers for this complex type of
> information. ****
>
>  ****
>
>  ****
>
> To make it a bit more simple, using only termConfidence to distinguish
> between user approved or trusted terms is not enough as the termConfidence
> is not reliable for such purposes.****
>
>  ****
>
> A natural representation, in my opinion, would identify the “annotator”
> (using categories – term bank, user, automatic tool, authority), the term
> confidence and the ID of the “annotator” (in order to identify the
> annotator precisely).****
>
>  ****
>
> Of course, for TermBank based terms there should be also a reference
> pointer so that more information could be identified.****
>
>  ****
>
>  ****
>
> Understand - the question mainly is: what needs to be standardized, and
> what could be a URI to that complex information.****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Actually ... one question that is* out of topic *here ... I tried
> following your discussions about the MT related “Translate” data category
> and a question arose: do you distinguish between something that:****
>
> 1)      has to be passed through a translation system, but should not be
> translated (should be kept as is, but is helpful for disambiguation of the
> translatable parts);****
>
> 2)      has to be completely ignored and not even passed through a
> translation system (for instance, numbers in tables, encrypted images
> within HTML5, etc.).****
>
>  ****
>
> From what I have understood (maybe I did not get the full picture) – the
> “Translate” tag is meant only for an MT system to tell it that something
> has to be kept as is, but some parts could be irrelevant to send through
> the MT systems, but that is not solved by the Translate tag.****
>
>  ****
>
> "Translate" in fact is very general and doesn't distinguish between 1) and
> 2). E.g. IIRC, in Okapi it is used also to create pseudo translated text.
> ****
>
>  ****
>
> Best,****
>
>
> Felix****
>
>  ****
>
>  ****
>
> Best regards,****
>
> Mārcis Pinnis****
>
> Researcher****
>
> Tilde****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 2:54 PM
> *To:* Tatiana Gornostay
> *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org; Mārcis Pinnis;
> Raivis Skadiņš; Andrejs Vasiļjevs****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Dear Tatiana, all,****
>
> 2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv>****
>
> Dear Felix, Yves, Dear All,****
>
>  ****
>
> W.r.t. the ongoing discussion on *toolInfo* and *mtConfidence*, I have in
> mind the following potential attributes proposed by Tilde in view of
> terminology use case, I mean, *its-termInfoRef*, *its-termCandidate*, and
> *its-termConfidence* and their values. ****
>
>  ****
>
> Would it also work to just add "termConfidence" to****
>
>  ****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation
> ****
>
>  ****
>
> we then could say: something is a term then the confidence is 1, that is *
> ***
>
> <span its:term="yes" its:termInfoRef="...">...</span> (ITS 1.0 or ITS 2.0)
> ****
>
> is equal to ****
>
> <span its:term="yes" its:termInfoRef="..." termConfidence="1">...</span>
> (ITS 2.0)****
>
> and a term candidate would be****
>
> <span its:term="yes" its:termInfoRef="..." termConfidence="0.9">...</span>
> (ITS 2.0)****
>
>  ****
>
> Felix ****
>
> These are not represented in the current draft  and if we go this way then
> we will have to discuss and, probably, add them. I can remember that Tadej
> raised this  questionin Prague and we did not talk about it, unfortunately.
> On the other hand, as soon as we start the project we will have opportunity
> and time to do it and my colleagues will also join the discussion.****
>
>  ****
>
> With best wishes,****
>
> Tatiana****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Wednesday, October 03, 2012 12:29 AM
> *To:* Yves Savourel
> *Cc:* public-multilingualweb-lt@w3.org****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi Yves, all,****
>
>  ****
>
> no opinion on my side on the delimiter topic, sorry for bringing it up. A
> comment on the tool specific aspect below.****
>
> 2012/10/2 Yves Savourel <ysavourel@enlaso.com>****
>
> > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1"
> > xlmns:its="http://www.w3.org/2005/11/its">
> >****
>
> > Would it make sense to use a different delimiter? "/" may conflict with
> "/" in paths.****
>
> Hmm... almost any ASCII delimiter may also be in the path. The first
> occurrence is the delimiter.
> But I suppose '|' could be used instead. It just doesn't look as graceful
> for some reason.****
>
>
>
> > Do you need the "dataCategory" attribute? It seems the
> > data category is made explicit via the reference mechanism in
> "its:toolRefs".
> > Also, dropping the "dataCategory" attribute allows then to refer to
> > the same tools from various data categories - e.g. OKAPI used for quality
> > issue versus for creating translation metadata etc.****
>
> I'm not sure we can go from many data category instances to one tool
> information. And this is where I'm having trouble with tool information:
>
> The mtConfidence need to have a defined way to specify the engine used****
>
>  ****
>
> Is there really a defined way? The current version of the draft at****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation
> ****
>
> says:****
>
>  ****
>
> "Some examples of values are:****
>
> A BCP 47 language tag with t-extension, e.g. ja-t-it for an Italian to
> Japanese MT engine****
>
> A Domain as per the Section 6.9: Domain****
>
> A privately structured string, eg. Domain:IT-Pair:IT-JA, IT-JA:Medical,
> etc."****
>
>  ****
>
> To me that is the same as saying: you can use anything. Of course we can
> wrap the "anything" in a field saying "here is MT engine information". Is
> that what you mean?****
>
>  ****
>
>  ****
>
> , the Text analysis may need something else****
>
>  ****
>
> I actually doubt that the text analysis "anything" will be more specific.
> My prediction is that there will be not more interop than saying "in this
> field there is data category specific information: ...".  ****
>
>  ****
>
> So you could achieve that by changing your proposal like this****
>
>  ****
>
> <its:processInfo>****
>
>  ****
>
>  ****
>
>  <its:toolInfo xml:id="T1">****
>
>   <its:toolName>Bing Translator</its:toolName>****
>
>   <its:toolVersion>123</its:toolVersion>****
>
>   <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  <its:toolInfo>****
>
>  <its:toolInfo xml:id="T2">****
>
>   <its:toolName>myMT</its:toolName>****
>
>   <its:toolVersion>456</its:toolVersion>****
>
>   <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>****
>
>  ****
>
>  <its:toolInfo>****
>
>  ****
>
>  ****
>
>  ****
>
> <its:processInfo>****
>
>  ****
>
> and allow for several addInfo elements in one "toolInfo". You won't gain a
> lot from these, but not less as with "FR-to-EN-General" inside "toolValue"
> at****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html
> ****
>
>  ****
>
> Best,****
>
>  ****
>
> Felix****
>
>  ****
>
>  ****
>
> , etc. It seems each data category will need one or two entry that mean
> different things depending on the data category. We can use a common
> element for this, but then we need to have one tool information per data
> category.
>
> Maybe the examples people are working on (action items 239 to 243 for
> Arle, Phil, Declan and Tadej) will help in defining this.
>
> Cheers
> -yves****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki ****
>
> DFKI / W3C Fellow****
>
>  ****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Felix Sasaki ****
>
> DFKI / W3C Fellow****
>
> ** **
>
> ** **
>



-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Thursday, 11 October 2012 13:09:41 UTC