- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 16 Oct 2012 09:47:14 +0200
- To: Mārcis Pinnis <marcis.pinnis@tilde.lv>
- Cc: Dave Lewis <dave.lewis@cs.tcd.ie>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-ID: <CAL58czqNH-kXHnA3YM8Z4kor7iJgYKxsvJoRYuHWsug6JQMyYQ@mail.gmail.com>
2012/10/15 Mārcis Pinnis <marcis.pinnis@tilde.lv> > Hi Felix,**** > > ** ** > > Seems like provenance could do the trick. Although, as you said, this > would “hardwire” the provenance and translate categories. > Yes, but only in a specific application, not in ITS 2.0 itself. That's not the best solution, but better than changing "translate" IMO. Best, Felix > But ... I guess, there are no better suggestions?!**** > > ** ** > > Best regards,**** > > Mārcis ;o)**** > > ** ** > > *From:* Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Thursday, October 11, 2012 3:49 PM > *To:* Mārcis Pinnis > *Cc:* Dave Lewis; public-multilingualweb-lt@w3.org > > *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** > > ** ** > > Hi Mārcis, all,**** > > 2012/10/11 Mārcis Pinnis <marcis.pinnis@tilde.lv>**** > > Hi Dave,**** > > **** > > With the third option I mean the situation when you have, for instance, > embedded in the data (what format or what tags, does not actually matter) > some information (let’s say 5MB of encoded data), which should never be > processed with a translation engine as that would be useless waste of > computational resources (with large amounts of such information also > sometimes raise stability issues... and require much more intensive > development efforts to make systems stable enough). If you do process it > and say that it is useful context, but keep the translation as is, you > actually ask the MT engine to deal with such maybe vast amounts of data and > use it for contextual information. But ... it may even not contain any > useful contextual information.**** > > **** > > In my opinion, when building a Web access MT system, I personally would > divide all data in three groups: 1) translatable, 2) non-translatable with > useful contextual information, 3) non-translatable with no useful > contextual information (ignorable).**** > > **** > > The question is, whether you want in ITS to allow MT engines to identify > the third category, or You think that it is not relevant to ITS? Nowadays > when formats get changed and overfilled with embedded information, I think > it would be useful to be able to distinguish between all three categories > and not just the two. Any thoughts?**** > > **** > > ** ** > > We may run in circles a bit ... but let the summarize the background: We > cannot change translate, there is too many existing MT tools (e.g. online > MT systems) or also localization tools (without any MT), and formats > (HTML5, DITA, ...) that rely on just two values yes and no.**** > > ** ** > > So we can continue to discuss "translate", but it cannot be changed for > above reasons. **** > > ** ** > > Now, your use case 3) could be realized with a combination of data > categories. The combination translate + localeFilter is probably a bad > choice, but how about translate (or not translate) + provenance? We soon > will have a draft of provenance, so maybe we can develop examples from > where.**** > > ** ** > > The bottom line is that you don't want to hardwire such combinations of > data categories - the basic idea of ITS is that data categories are > "atomic" in the sense of: really convey a minimum piece of information, to > be used in many different workflows (both e.g. human translation, MT, or no > translation at all).**** > > ** ** > > Best,**** > > ** ** > > Felix**** > > ** ** > > ** ** > > ** ** > > **** > > Best regards,**** > > Mārcis ;o)**** > > **** > > *From:* Dave Lewis [mailto:dave.lewis@cs.tcd.ie] > *Sent:* Wednesday, October 10, 2012 2:57 AM > *To:* public-multilingualweb-lt@w3.org**** > > > *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** > > **** > > Hi Mārcis, Felix > I'm not sure I fully understand the use case you are addressing with these > translation enumeration extensions. > > I know from Declan that with Moses, you can handle no translates just by > marking the text as something to be translated as itself, so it still get > physically processed by the engine, but this is simpler than removing the > text (with some loss of context). So annotations designed to prevent > 'unnecessary' machine translations may not be very worthwhile. > > Is the use case more, therefore, that you want to alert the translation > provider that the text probably won't be well translated by machine and > should be prioritised for human translation or postediting? > > Either way I'd reinforce Felix's point about the problems changing the > translation enumeration. It would be a backward compatibility violation > with ITS1.0, and a major one because there are several implementations > using the existing yes/no enumeration. > > The prioritisation of certain processes was actually a requirement we > identified early on (coming from an open session we held at a > MultilingualWeb workshop in Luxembourg): see: > > http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#readiness > > This might be a better route to meeting this use case. > > cheers, > Dave > > > > > On 09/10/2012 14:29, Felix Sasaki wrote:**** > > Hi Mārcis,**** > > 2012/10/9 Mārcis Pinnis <marcis.pinnis@tilde.lv>**** > > Hi, all,**** > > **** > > (replied inline)**** > > **** > > Best regards,**** > > Mārcis ;o)**** > > **** > > *From:* Tadej Štajner [mailto:tadej.stajner@ijs.si] > *Sent:* Tuesday, October 09, 2012 3:02 PM > *To:* Felix Sasaki > *Cc:* Mārcis Pinnis; Tatiana Gornostay; Yves Savourel; > public-multilingualweb-lt@w3.org; Raivis Skadiņš; Andrejs Vasiļjevs**** > > > *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** > > **** > > Hi, all, **** > > (reply inline) > > On 09. 10. 2012 09:15, Felix Sasaki wrote:**** > > Hi Mārcis,**** > > 2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv>**** > > Hi Felix,**** > > **** > > I believe that the “processInfo” (if renamed from “toolInfo”) will not > overlap with provenance (although, I do not think that process is the right > name – annotatorInfo would sound more reasonable). Provenance is something > that is assigned to a term (a specific concept) by an authority and not the > annotation or an annotation tool/user. That is, a user could mark a term, > but he would not be responsible for the provenance of the term as that is > assigned to the term in a term bank by someone with rights to do so (or the > creator of the term). Also, provenance for terms is already given in a term > bank, thus we would not need to standardize something that can be > referenced to (following your thought of what can be referenced and what > should be standardized). However, for automated processes it can be useful > to know, how trustworthy an annotation is. This can be done in two ways – > 1) follow a term bank reference and check the provenance for terms that are > linked to a term bank entry; 2) decide based on the annotator, how > trustworthy the term might be (for term candidates and terms not linked to > a term bank entry).**** > > **** > > I hope our understanding of what provenance in this case is does not > differ (I am referring to term provenance)?! In the case if by provenance > You meant something like the “annotation’s provenance”, then I agree that, > by identifying the annotator, we will also add an annotation provenance. > However, automated systems can benefit if the source of the content > annotation can be identified (or at least traced...). What are your > thoughts in this matter? How much do you want to ensure traceability in ITS? > **** > > **** > > **** > > I would like to keep the principle of disjunct data categories, and leave > it to applications to interrelate provenance information for the content. > Wrt to tracebility of ITS information, yes, I agree - that IMO would be the > main use case for tool information. The question whether traceability can > be assured "only" via an URI, see**** > > > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html > **** > > **** > > Mārcis, Tadej, David, ... any thoughts?**** > > **** > > > As I understand, we're dealing with: > 1) provenance of term itself > 2) provenance of an instance annotation of the term in some text > > 1 is probably out of scope, 2 is something that we'd cover by the > toolInfo/processInfo attribute. Maybe 1) is also interesting in some cases, > but I would speculate that it's rarely something I'd want to inline in a > document with an annotation. > > Also, would 'agent' be a clearer term for 'tool info' or 'process info'? > > -- Tadej**** > > **** > > 1 is covered in term banks (or ... at least should be) and probably is out > of scope as I understand it. Actually this is a data category that, if > necessary, should be resolved by applications (programs/users) following > the references to the term entries in a term bank (if such are given), thus > the annotation should not be redundant.**** > > For 2, I think Tadej’s idea about “agentInfo” is more appropriate than > “toolInfo” or “processInfo”.**** > > **** > > Felix**** > > **** > > About Translate, I meant the understanding from a machine user’s > perspective. For a machine user (MT system) 1) and 2) may be equally > important and it would be good if the machine user would be able to > distinguish the two types within a document. If I understand locNote > correctly, this category is not meant for machine users, but rather human > translators.**** > > I agree with your statements about locNote, and I understand the need to > distinguish the two types in a document. What you describe as 2) could be > achieved by locale filter**** > > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#LocaleFilter-implementation > **** > > e.g. **** > > <its:rules version="2.0"> <its:localeFilterRule selector="//img" > localeFilterList=""/> </its:rules>**** > > This expresses that all "img" elements are not part of the localization > workflow. Would that fulfil your needs?**** > > **** > > I agree, this would do the trick. However, won’t this corrupt the data for > other purposes (for instance, if in a table currencies would have to be > converted (not translated) to a different locale currency by some > specialists)? That is, I think that re-using of the locale filter for MT > purposes might actually cause some other processes not to work... An easier > solution, in my opinion, would be to make the Translate category enumerable > (translate=”keep-as-is” or translate=”no”; translate=”yes”; > translate=”ignore”, ignore being the indication that a segment would have > to be ignored/skipped by a translation engine). Any thoughts on this?**** > > **** > > **** > > I agree with your feedback about localeRule. However, overloading > "translate" would cause a mismatch with other vocabularies that use a > "translate" attribute: e.g. both DITA and HTML5 have a translate attribute > in no or different namespace with the same semantics as ITS "translate". > Adding more values would create a misalignment. **** > > **** > > To get a feeling about the importance of this: who would implement an > additional value for "translate" (or the meaning of "keep-as-is" in a > separate data category) - who would need that use case?**** > > **** > > Felix**** > > **** > > **** > > Best,**** > > **** > > Felix**** > > Best regards,**** > > Mārcis ;o)**** > > **** > > *From:* Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Thursday, October 04, 2012 6:40 PM**** > > > *To:* Mārcis Pinnis > *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org; > Raivis Skadiņš; Andrejs Vasiļjevs > *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** > > **** > > Hi Mārcis,**** > > **** > > your mail did not reach the list. Just FIY, I think you were subscribed to > the list with need to send it with**** > > marcis.pinnis@Tilde.lv (with upper case "T" in tilde.) I changed that to > marcis.pinnis@tilde.lv, so your next mail should reach the list. Some > comments below. **** > > **** > > 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>**** > > Dear Felix,**** > > **** > > Thank you for the explanation. I see that the toolinfo can manage the > identification of toos. But does ITS also require users (people) to be > treated as tools. **** > > **** > > **** > > We could rename "tool" to process - and would end up with provenance. But > maybe that's sufficient. **** > > **** > > **** > > That was not clear to me. Or, does ITS specify separate tags for > identification of who/what added an annotation?**** > > **** > > No, that's exactly the point: we don't have a way to specify "who created > an annotation?". The purpose of "tool info" is just that. And it is - to > use that nice word again - "orthogonal" to the data category annotation > itself. That is, you want to relate it to its:term, but you don't want to > repeat it all the time, and you don't want to make it mandatory.**** > > **** > > **** > > I guess, it is clear that a “termConfidence” is necessary. And the “term” > tag is required (the termCandidate can be ommited as that could potentially > be redundant if a reference of the annotator or the authority of annotation > is given).**** > > **** > > On the Translate question maybe you can explain a bit more why, in your > opinion, the 1) and 2) should be combined in a general meaning? They both > describe data that has to be handled differently. The “Translate” category > as I understand solves either 1) or 2) (and this depends on every > implementation), but not both.**** > > **** > > **** > > Yes, that was my point: we leave it to the implementation whether the > implementation wants to handle 1) or 2). The main idea of ITS is specify > really atomic metadata items. **** > > **** > > Your requirement to differentiate 1) vs. 2) could e.g. be handled by a > localization note:**** > > **** > > <its:locNoteRule selector="//h:img" locNote="Drop this in the workflow, > don't give it to translator"/>**** > > **** > > But you are probably looking for a machine readable way to achieve this?** > ** > > **** > > Best,**** > > **** > > Felix **** > > **** > > **** > > Best regards,**** > > Mārcis.**** > > **** > > *From:* Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Thursday, October 04, 2012 3:58 PM > *To:* Mārcis Pinnis > *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org; > Raivis Skadiņš; Andrejs Vasiļjevs**** > > > *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** > > **** > > **** > > 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>**** > > Dear Felix,**** > > **** > > Having only the confidence distinguishing between an automatically > identified term and a user approved term is not enough as various term > annotation tools can have different confidence scores (they may be also in > log form depending on the implementation). Thus having a strict value “1” > for user approved/ term-bank based terms is not enough. In an ideal > scenario, at least from my perspective, there should be a way to identify > who (a system, which system, a user, who?, and authority, which authority?) > annotated each term (not just in document level, but also in individual > term level) and what is the confidence of the respective identifier given > to the term candidate (or even a term).**** > > **** > > **** > > Understand. That might bring us to "toolinfo" again. The solution that > Yves mentioned at**** > > > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html > **** > > would allow you to create identifiers for this complex type of > information. **** > > **** > > **** > > To make it a bit more simple, using only termConfidence to distinguish > between user approved or trusted terms is not enough as the termConfidence > is not reliable for such purposes.**** > > **** > > A natural representation, in my opinion, would identify the “annotator” > (using categories – term bank, user, automatic tool, authority), the term > confidence and the ID of the “annotator” (in order to identify the > annotator precisely).**** > > **** > > Of course, for TermBank based terms there should be also a reference > pointer so that more information could be identified.**** > > **** > > **** > > Understand - the question mainly is: what needs to be standardized, and > what could be a URI to that complex information.**** > > **** > > **** > > **** > > **** > > **** > > Actually ... one question that is* out of topic *here ... I tried > following your discussions about the MT related “Translate” data category > and a question arose: do you distinguish between something that:**** > > 1) has to be passed through a translation system, but should not be > translated (should be kept as is, but is helpful for disambiguation of the > translatable parts);**** > > 2) has to be completely ignored and not even passed through a > translation system (for instance, numbers in tables, encrypted images > within HTML5, etc.).**** > > **** > > From what I have understood (maybe I did not get the full picture) – the > “Translate” tag is meant only for an MT system to tell it that something > has to be kept as is, but some parts could be irrelevant to send through > the MT systems, but that is not solved by the Translate tag.**** > > **** > > "Translate" in fact is very general and doesn't distinguish between 1) and > 2). E.g. IIRC, in Okapi it is used also to create pseudo translated text. > **** > > **** > > Best,**** > > > Felix**** > > **** > > **** > > Best regards,**** > > Mārcis Pinnis**** > > Researcher**** > > Tilde**** > > **** > > *From:* Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Thursday, October 04, 2012 2:54 PM > *To:* Tatiana Gornostay > *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org; Mārcis Pinnis; > Raivis Skadiņš; Andrejs Vasiļjevs**** > > > *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** > > **** > > Dear Tatiana, all,**** > > 2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv>**** > > Dear Felix, Yves, Dear All,**** > > **** > > W.r.t. the ongoing discussion on *toolInfo* and *mtConfidence*, I have in > mind the following potential attributes proposed by Tilde in view of > terminology use case, I mean, *its-termInfoRef*, *its-termCandidate*, and > *its-termConfidence* and their values. **** > > **** > > Would it also work to just add "termConfidence" to**** > > **** > > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation > **** > > **** > > we then could say: something is a term then the confidence is 1, that is * > *** > > <span its:term="yes" its:termInfoRef="...">...</span> (ITS 1.0 or ITS 2.0) > **** > > is equal to **** > > <span its:term="yes" its:termInfoRef="..." termConfidence="1">...</span> > (ITS 2.0)**** > > and a term candidate would be**** > > <span its:term="yes" its:termInfoRef="..." termConfidence="0.9">...</span> > (ITS 2.0)**** > > **** > > Felix **** > > These are not represented in the current draft and if we go this way then > we will have to discuss and, probably, add them. I can remember that Tadej > raised this questionin Prague and we did not talk about it, unfortunately. > On the other hand, as soon as we start the project we will have opportunity > and time to do it and my colleagues will also join the discussion.**** > > **** > > With best wishes,**** > > Tatiana**** > > **** > > *From:* Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Wednesday, October 03, 2012 12:29 AM > *To:* Yves Savourel > *Cc:* public-multilingualweb-lt@w3.org**** > > > *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** > > **** > > Hi Yves, all,**** > > **** > > no opinion on my side on the delimiter topic, sorry for bringing it up. A > comment on the tool specific aspect below.**** > > 2012/10/2 Yves Savourel <ysavourel@enlaso.com>**** > > > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1" > > xlmns:its="http://www.w3.org/2005/11/its"> > >**** > > > Would it make sense to use a different delimiter? "/" may conflict with > "/" in paths.**** > > Hmm... almost any ASCII delimiter may also be in the path. The first > occurrence is the delimiter. > But I suppose '|' could be used instead. It just doesn't look as graceful > for some reason.**** > > > > > Do you need the "dataCategory" attribute? It seems the > > data category is made explicit via the reference mechanism in > "its:toolRefs". > > Also, dropping the "dataCategory" attribute allows then to refer to > > the same tools from various data categories - e.g. OKAPI used for quality > > issue versus for creating translation metadata etc.**** > > I'm not sure we can go from many data category instances to one tool > information. And this is where I'm having trouble with tool information: > > The mtConfidence need to have a defined way to specify the engine used**** > > **** > > Is there really a defined way? The current version of the draft at**** > > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation > **** > > says:**** > > **** > > "Some examples of values are:**** > > A BCP 47 language tag with t-extension, e.g. ja-t-it for an Italian to > Japanese MT engine**** > > A Domain as per the Section 6.9: Domain**** > > A privately structured string, eg. Domain:IT-Pair:IT-JA, IT-JA:Medical, > etc."**** > > **** > > To me that is the same as saying: you can use anything. Of course we can > wrap the "anything" in a field saying "here is MT engine information". Is > that what you mean?**** > > **** > > **** > > , the Text analysis may need something else**** > > **** > > I actually doubt that the text analysis "anything" will be more specific. > My prediction is that there will be not more interop than saying "in this > field there is data category specific information: ...". **** > > **** > > So you could achieve that by changing your proposal like this**** > > **** > > <its:processInfo>**** > > **** > > **** > > <its:toolInfo xml:id="T1">**** > > <its:toolName>Bing Translator</its:toolName>**** > > <its:toolVersion>123</its:toolVersion>**** > > <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>**** > > **** > > **** > > **** > > ** ** > > **** > > <its:toolInfo>**** > > <its:toolInfo xml:id="T2">**** > > <its:toolName>myMT</its:toolName>**** > > <its:toolVersion>456</its:toolVersion>**** > > <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>**** > > **** > > <its:toolInfo>**** > > ** ** > > **** > > **** > > **** > > <its:processInfo>**** > > **** > > and allow for several addInfo elements in one "toolInfo". You won't gain a > lot from these, but not less as with "FR-to-EN-General" inside "toolValue" > at**** > > > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html > **** > > **** > > Best,**** > > **** > > Felix**** > > **** > > **** > > , etc. It seems each data category will need one or two entry that mean > different things depending on the data category. We can use a common > element for this, but then we need to have one tool information per data > category. > > Maybe the examples people are working on (action items 239 to 243 for > Arle, Phil, Declan and Tadej) will help in defining this. > > Cheers > -yves**** > > > > **** > > **** > > -- > Felix Sasaki**** > > DFKI / W3C Fellow**** > > **** > > > > **** > > **** > > -- > Felix Sasaki**** > > DFKI / W3C Fellow**** > > **** > > > > **** > > **** > > -- > Felix Sasaki**** > > DFKI / W3C Fellow**** > > **** > > > > **** > > **** > > -- > Felix Sasaki**** > > DFKI / W3C Fellow**** > > **** > > > > **** > > **** > > -- > Felix Sasaki **** > > DFKI / W3C Fellow**** > > **** > > **** > > > > **** > > **** > > -- > Felix Sasaki **** > > DFKI / W3C Fellow**** > > **** > > **** > > > > **** > > ** ** > > -- > Felix Sasaki**** > > DFKI / W3C Fellow**** > > ** ** > -- Felix Sasaki DFKI / W3C Fellow
Received on Tuesday, 16 October 2012 07:47:40 UTC