Re: [ISSUE-42] Wording for the tool information markup

2012/10/15 Mārcis Pinnis <marcis.pinnis@tilde.lv>

> Hi Felix,****
>
> ** **
>
> Seems like provenance could do the trick. Although, as you said, this
> would “hardwire” the provenance and translate categories.
>


Yes, but only in a specific application, not in ITS 2.0 itself. That's not
the best solution, but better than changing "translate"  IMO.

Best,

Felix


> But ... I guess, there are no better suggestions?!****
>
> ** **
>
> Best regards,****
>
> Mārcis ;o)****
>
> ** **
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 11, 2012 3:49 PM
> *To:* Mārcis Pinnis
> *Cc:* Dave Lewis; public-multilingualweb-lt@w3.org
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
> ** **
>
> Hi Mārcis, all,****
>
> 2012/10/11 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Hi Dave,****
>
>  ****
>
> With the third option I mean the situation when you have, for instance,
> embedded in the data (what format or what tags, does not actually matter)
> some information (let’s say 5MB of encoded data), which  should never be
> processed with a translation engine as that would be useless waste of
> computational resources (with large amounts of such information also
> sometimes raise stability issues... and require much more intensive
> development efforts to make systems stable enough). If you do process it
> and say that it is useful context, but keep the translation as is, you
> actually ask the MT engine to deal with such maybe vast amounts of data and
> use it for contextual information. But ... it may even not contain any
> useful contextual information.****
>
>  ****
>
> In my opinion, when building a Web access MT system, I personally would
> divide all data in three groups: 1) translatable, 2) non-translatable with
> useful contextual information, 3) non-translatable with no useful
> contextual information (ignorable).****
>
>  ****
>
> The question is, whether you want in ITS to allow MT engines to identify
> the third category, or You think that it is not relevant to ITS? Nowadays
> when formats get changed and overfilled with embedded information, I think
> it would be useful to be able to distinguish between all three categories
> and not just the two. Any thoughts?****
>
>  ****
>
> ** **
>
> We may run in circles a bit ... but let the summarize the background: We
> cannot change translate, there is too many existing MT tools (e.g. online
> MT systems) or also localization tools (without any MT), and formats
> (HTML5, DITA, ...) that rely on just two values yes and no.****
>
> ** **
>
> So we can continue to discuss "translate", but it cannot be changed for
> above reasons. ****
>
> ** **
>
> Now, your use case 3) could be realized with a combination of data
> categories. The combination translate + localeFilter is probably a bad
> choice, but how about translate (or not translate) + provenance? We soon
> will have a  draft of provenance, so maybe we can develop examples from
> where.****
>
> ** **
>
> The bottom line is that you don't want to hardwire such combinations of
> data categories - the basic idea of ITS is that data categories are
> "atomic" in the sense of: really convey a minimum piece of information, to
> be used in many different workflows (both e.g. human translation, MT, or no
> translation at all).****
>
> ** **
>
> Best,****
>
> ** **
>
> Felix****
>
> ** **
>
> ** **
>
> ** **
>
>  ****
>
> Best regards,****
>
> Mārcis ;o)****
>
>  ****
>
> *From:* Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> *Sent:* Wednesday, October 10, 2012 2:57 AM
> *To:* public-multilingualweb-lt@w3.org****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi Mārcis, Felix
> I'm not sure I fully understand the use case you are addressing with these
> translation enumeration extensions.
>
> I know from Declan that with Moses, you can handle no translates just by
> marking the text as something to be translated as itself, so it still get
> physically processed by the engine, but this is simpler than removing the
> text (with some loss of context). So annotations designed to prevent
> 'unnecessary' machine translations may not be very worthwhile.
>
> Is the use case more, therefore, that you want to alert the translation
> provider that the text probably won't be well translated by machine and
> should be prioritised for human translation or postediting?
>
> Either way I'd reinforce Felix's point about the problems changing the
> translation enumeration. It would be a backward compatibility violation
> with ITS1.0, and a major one because there are several implementations
> using the existing yes/no enumeration.
>
> The prioritisation of certain processes was actually a requirement we
> identified early on (coming from an open session we held at a
> MultilingualWeb workshop in Luxembourg): see:
>
> http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#readiness
>
> This might be a better route to meeting this use case.
>
> cheers,
> Dave
>
>
>
>
> On 09/10/2012 14:29, Felix Sasaki wrote:****
>
> Hi Mārcis,****
>
> 2012/10/9 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Hi, all,****
>
>  ****
>
> (replied inline)****
>
>  ****
>
> Best regards,****
>
> Mārcis ;o)****
>
>  ****
>
> *From:* Tadej Štajner [mailto:tadej.stajner@ijs.si]
> *Sent:* Tuesday, October 09, 2012 3:02 PM
> *To:* Felix Sasaki
> *Cc:* Mārcis Pinnis; Tatiana Gornostay; Yves Savourel;
> public-multilingualweb-lt@w3.org; Raivis Skadiņš; Andrejs Vasiļjevs****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi, all, ****
>
> (reply inline)
>
> On 09. 10. 2012 09:15, Felix Sasaki wrote:****
>
> Hi Mārcis,****
>
> 2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Hi Felix,****
>
>  ****
>
> I believe that the “processInfo” (if renamed from “toolInfo”) will not
> overlap with provenance (although, I do not think that process is the right
> name – annotatorInfo would sound more reasonable). Provenance is something
> that is assigned to a term (a specific concept) by an authority and not the
> annotation or an annotation tool/user. That is, a user could mark a term,
> but he would not be responsible for the provenance of the term as that is
> assigned to the term in a term bank by someone with rights to do so (or the
> creator of the term). Also, provenance for terms is already given in a term
> bank, thus we would not need to standardize something that can be
> referenced to (following your thought of what can be referenced and what
> should be standardized). However, for automated processes it can be useful
> to know, how trustworthy an annotation is. This can be done in two ways –
> 1) follow a term bank reference and check the provenance for terms that are
> linked to a term bank entry; 2) decide based on the annotator, how
> trustworthy the term might be (for term candidates and terms not linked to
> a term bank entry).****
>
>  ****
>
> I hope our understanding of what provenance in this case is does not
> differ (I am referring to term provenance)?! In the case if by provenance
> You meant something like the “annotation’s provenance”, then I agree that,
> by identifying the annotator, we will also add an annotation provenance.
> However, automated systems can benefit if the source of the content
> annotation can be identified (or at least traced...). What are your
> thoughts in this matter? How much do you want to ensure traceability in ITS?
> ****
>
>  ****
>
>  ****
>
> I would like to keep the principle of disjunct data categories, and leave
> it to applications to interrelate provenance information for the content.
> Wrt to tracebility of ITS information, yes, I agree - that IMO would be the
> main use case for tool information. The question whether traceability can
> be assured "only" via an URI, see****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
> ****
>
>  ****
>
>  Mārcis, Tadej, David,  ... any thoughts?****
>
>  ****
>
>
> As I understand, we're dealing with:
> 1) provenance of term itself
> 2) provenance of an instance annotation of the term in some text
>
> 1 is probably out of scope, 2 is something that we'd cover by the
> toolInfo/processInfo attribute. Maybe 1) is also interesting in some cases,
> but I would speculate that it's rarely something I'd want to inline in a
> document with an annotation.
>
> Also, would 'agent' be a clearer term for 'tool info' or 'process info'?
>
> -- Tadej****
>
>  ****
>
> 1 is covered in term banks (or ... at least should be) and probably is out
> of scope as I understand it. Actually this is a data category that, if
> necessary, should be resolved by applications (programs/users) following
> the references to the term entries in a term bank (if such are given), thus
> the annotation should not be redundant.****
>
> For 2, I think Tadej’s idea about “agentInfo” is more appropriate than
> “toolInfo” or “processInfo”.****
>
>  ****
>
> Felix****
>
>  ****
>
> About Translate, I meant the understanding from a machine user’s
> perspective. For a machine user (MT system) 1) and 2) may be equally
> important and it would be good if the machine user would be able to
> distinguish the two types within a document. If I understand locNote
> correctly, this category is not meant for machine users, but rather human
> translators.****
>
> I agree with your statements about locNote, and I understand the need to
> distinguish the two types in a document. What you describe as 2) could be
> achieved by locale filter****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#LocaleFilter-implementation
> ****
>
> e.g. ****
>
> <its:rules version="2.0"> <its:localeFilterRule selector="//img"
> localeFilterList=""/> </its:rules>****
>
> This expresses that all "img" elements are not part of the localization
> workflow. Would that fulfil your needs?****
>
>  ****
>
> I agree, this would do the trick. However, won’t this corrupt the data for
> other purposes (for instance, if in a table currencies would have to be
> converted (not translated) to a different locale currency by some
> specialists)? That is, I think that re-using of the locale filter for MT
> purposes might actually cause some other processes not to work... An easier
> solution, in my opinion, would be to make the Translate category enumerable
> (translate=”keep-as-is” or translate=”no”; translate=”yes”;
> translate=”ignore”, ignore being the indication that a segment would have
> to be ignored/skipped by a translation engine). Any thoughts on this?****
>
>  ****
>
>  ****
>
> I agree with your feedback about localeRule. However, overloading
> "translate" would cause a mismatch with other vocabularies that use a
> "translate" attribute: e.g. both DITA and HTML5 have a translate attribute
> in no or different namespace with the same semantics as ITS "translate".
> Adding more values would create a misalignment. ****
>
>  ****
>
> To get a feeling about the importance of this: who would implement an
> additional value for "translate" (or the meaning of "keep-as-is" in a
> separate data category) - who would need that use case?****
>
>  ****
>
> Felix****
>
>  ****
>
>  ****
>
> Best,****
>
>  ****
>
> Felix****
>
> Best regards,****
>
> Mārcis ;o)****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 6:40 PM****
>
>
> *To:* Mārcis Pinnis
> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org;
> Raivis Skadiņš; Andrejs Vasiļjevs
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi Mārcis,****
>
>  ****
>
> your mail did not reach the list. Just FIY, I think you were subscribed to
> the list with need to send it with****
>
> marcis.pinnis@Tilde.lv (with upper case "T" in tilde.) I changed that to
> marcis.pinnis@tilde.lv, so your next mail should reach the list. Some
> comments below. ****
>
>  ****
>
> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Dear Felix,****
>
>  ****
>
> Thank you for the explanation. I see that the toolinfo can manage the
> identification of toos. But does ITS also require users (people) to be
> treated as tools. ****
>
>  ****
>
>  ****
>
> We could rename "tool" to process - and would end up with provenance. But
> maybe that's sufficient. ****
>
>  ****
>
>  ****
>
> That was not clear to me. Or, does ITS specify separate tags for
> identification of who/what added an annotation?****
>
>  ****
>
> No, that's exactly the point: we don't have a way to specify "who created
> an annotation?". The purpose of "tool info" is just that. And it is - to
> use that nice word again - "orthogonal" to the data category annotation
> itself. That is, you want to relate it to its:term, but you don't want to
> repeat it all the time, and you don't want to make it mandatory.****
>
>  ****
>
>  ****
>
> I guess, it is clear that a “termConfidence” is necessary. And the “term”
> tag is required (the termCandidate can be ommited as that could potentially
> be redundant if a reference of the annotator or the authority of annotation
> is given).****
>
>  ****
>
> On the Translate question maybe you can explain a bit more why, in your
> opinion, the 1) and 2) should be combined in a general meaning? They both
> describe data that has to be handled differently. The “Translate” category
> as I understand solves either 1) or 2) (and this depends on every
> implementation), but not both.****
>
>  ****
>
>  ****
>
> Yes, that was my point: we leave it to the implementation whether the
> implementation wants to handle 1) or 2). The main idea of ITS is specify
> really atomic metadata items. ****
>
>  ****
>
> Your requirement to differentiate 1) vs. 2) could e.g. be handled by a
> localization note:****
>
>  ****
>
> <its:locNoteRule selector="//h:img" locNote="Drop this in the workflow,
> don't give it to translator"/>****
>
>  ****
>
> But you are probably looking for a machine readable way to achieve this?**
> **
>
>  ****
>
> Best,****
>
>  ****
>
> Felix ****
>
>  ****
>
>  ****
>
> Best regards,****
>
> Mārcis.****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 3:58 PM
> *To:* Mārcis Pinnis
> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org;
> Raivis Skadiņš; Andrejs Vasiļjevs****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
>  ****
>
> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Dear Felix,****
>
>  ****
>
> Having only the confidence distinguishing between an automatically
> identified term and a user approved term is not enough as various term
> annotation tools can have different confidence scores (they may be also in
> log form depending on the implementation). Thus having a strict value “1”
> for user approved/ term-bank based terms is not enough. In an ideal
> scenario, at least from my perspective, there should be a way to identify
> who (a system, which system, a user, who?, and authority, which authority?)
> annotated each term (not just in document level, but also in individual
> term level) and what is the confidence of the respective identifier given
> to the term candidate (or even a term).****
>
>  ****
>
>  ****
>
> Understand. That might bring us to "toolinfo" again. The solution that
> Yves mentioned at****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
> ****
>
> would allow you to create identifiers for this complex type of
> information. ****
>
>  ****
>
>  ****
>
> To make it a bit more simple, using only termConfidence to distinguish
> between user approved or trusted terms is not enough as the termConfidence
> is not reliable for such purposes.****
>
>  ****
>
> A natural representation, in my opinion, would identify the “annotator”
> (using categories – term bank, user, automatic tool, authority), the term
> confidence and the ID of the “annotator” (in order to identify the
> annotator precisely).****
>
>  ****
>
> Of course, for TermBank based terms there should be also a reference
> pointer so that more information could be identified.****
>
>  ****
>
>  ****
>
> Understand - the question mainly is: what needs to be standardized, and
> what could be a URI to that complex information.****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Actually ... one question that is* out of topic *here ... I tried
> following your discussions about the MT related “Translate” data category
> and a question arose: do you distinguish between something that:****
>
> 1)      has to be passed through a translation system, but should not be
> translated (should be kept as is, but is helpful for disambiguation of the
> translatable parts);****
>
> 2)      has to be completely ignored and not even passed through a
> translation system (for instance, numbers in tables, encrypted images
> within HTML5, etc.).****
>
>  ****
>
> From what I have understood (maybe I did not get the full picture) – the
> “Translate” tag is meant only for an MT system to tell it that something
> has to be kept as is, but some parts could be irrelevant to send through
> the MT systems, but that is not solved by the Translate tag.****
>
>  ****
>
> "Translate" in fact is very general and doesn't distinguish between 1) and
> 2). E.g. IIRC, in Okapi it is used also to create pseudo translated text.
> ****
>
>  ****
>
> Best,****
>
>
> Felix****
>
>  ****
>
>  ****
>
> Best regards,****
>
> Mārcis Pinnis****
>
> Researcher****
>
> Tilde****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 2:54 PM
> *To:* Tatiana Gornostay
> *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org; Mārcis Pinnis;
> Raivis Skadiņš; Andrejs Vasiļjevs****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Dear Tatiana, all,****
>
> 2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv>****
>
> Dear Felix, Yves, Dear All,****
>
>  ****
>
> W.r.t. the ongoing discussion on *toolInfo* and *mtConfidence*, I have in
> mind the following potential attributes proposed by Tilde in view of
> terminology use case, I mean, *its-termInfoRef*, *its-termCandidate*, and
> *its-termConfidence* and their values. ****
>
>  ****
>
> Would it also work to just add "termConfidence" to****
>
>  ****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation
> ****
>
>  ****
>
> we then could say: something is a term then the confidence is 1, that is *
> ***
>
> <span its:term="yes" its:termInfoRef="...">...</span> (ITS 1.0 or ITS 2.0)
> ****
>
> is equal to ****
>
> <span its:term="yes" its:termInfoRef="..." termConfidence="1">...</span>
> (ITS 2.0)****
>
> and a term candidate would be****
>
> <span its:term="yes" its:termInfoRef="..." termConfidence="0.9">...</span>
> (ITS 2.0)****
>
>  ****
>
> Felix ****
>
> These are not represented in the current draft  and if we go this way then
> we will have to discuss and, probably, add them. I can remember that Tadej
> raised this  questionin Prague and we did not talk about it, unfortunately.
> On the other hand, as soon as we start the project we will have opportunity
> and time to do it and my colleagues will also join the discussion.****
>
>  ****
>
> With best wishes,****
>
> Tatiana****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Wednesday, October 03, 2012 12:29 AM
> *To:* Yves Savourel
> *Cc:* public-multilingualweb-lt@w3.org****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi Yves, all,****
>
>  ****
>
> no opinion on my side on the delimiter topic, sorry for bringing it up. A
> comment on the tool specific aspect below.****
>
> 2012/10/2 Yves Savourel <ysavourel@enlaso.com>****
>
> > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1"
> > xlmns:its="http://www.w3.org/2005/11/its">
> >****
>
> > Would it make sense to use a different delimiter? "/" may conflict with
> "/" in paths.****
>
> Hmm... almost any ASCII delimiter may also be in the path. The first
> occurrence is the delimiter.
> But I suppose '|' could be used instead. It just doesn't look as graceful
> for some reason.****
>
>
>
> > Do you need the "dataCategory" attribute? It seems the
> > data category is made explicit via the reference mechanism in
> "its:toolRefs".
> > Also, dropping the "dataCategory" attribute allows then to refer to
> > the same tools from various data categories - e.g. OKAPI used for quality
> > issue versus for creating translation metadata etc.****
>
> I'm not sure we can go from many data category instances to one tool
> information. And this is where I'm having trouble with tool information:
>
> The mtConfidence need to have a defined way to specify the engine used****
>
>  ****
>
> Is there really a defined way? The current version of the draft at****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation
> ****
>
> says:****
>
>  ****
>
> "Some examples of values are:****
>
> A BCP 47 language tag with t-extension, e.g. ja-t-it for an Italian to
> Japanese MT engine****
>
> A Domain as per the Section 6.9: Domain****
>
> A privately structured string, eg. Domain:IT-Pair:IT-JA, IT-JA:Medical,
> etc."****
>
>  ****
>
> To me that is the same as saying: you can use anything. Of course we can
> wrap the "anything" in a field saying "here is MT engine information". Is
> that what you mean?****
>
>  ****
>
>  ****
>
> , the Text analysis may need something else****
>
>  ****
>
> I actually doubt that the text analysis "anything" will be more specific.
> My prediction is that there will be not more interop than saying "in this
> field there is data category specific information: ...".  ****
>
>  ****
>
> So you could achieve that by changing your proposal like this****
>
>  ****
>
> <its:processInfo>****
>
>  ****
>
>  ****
>
>  <its:toolInfo xml:id="T1">****
>
>   <its:toolName>Bing Translator</its:toolName>****
>
>   <its:toolVersion>123</its:toolVersion>****
>
>   <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>
>  ****
>
>  <its:toolInfo>****
>
>  <its:toolInfo xml:id="T2">****
>
>   <its:toolName>myMT</its:toolName>****
>
>   <its:toolVersion>456</its:toolVersion>****
>
>   <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>****
>
>  ****
>
>  <its:toolInfo>****
>
> ** **
>
>  ****
>
>  ****
>
>  ****
>
> <its:processInfo>****
>
>  ****
>
> and allow for several addInfo elements in one "toolInfo". You won't gain a
> lot from these, but not less as with "FR-to-EN-General" inside "toolValue"
> at****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html
> ****
>
>  ****
>
> Best,****
>
>  ****
>
> Felix****
>
>  ****
>
>  ****
>
> , etc. It seems each data category will need one or two entry that mean
> different things depending on the data category. We can use a common
> element for this, but then we need to have one tool information per data
> category.
>
> Maybe the examples people are working on (action items 239 to 243 for
> Arle, Phil, Declan and Tadej) will help in defining this.
>
> Cheers
> -yves****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki ****
>
> DFKI / W3C Fellow****
>
>  ****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki ****
>
> DFKI / W3C Fellow****
>
>  ****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
> ** **
>



-- 
Felix Sasaki
DFKI / W3C Fellow

Received on Tuesday, 16 October 2012 07:47:40 UTC