- From: Tadej Štajner <tadej.stajner@ijs.si>
- Date: Tue, 09 Oct 2012 14:01:40 +0200
- To: Felix Sasaki <fsasaki@w3.org>
- CC: Mārcis Pinnis <marcis.pinnis@tilde.lv>, Tatiana Gornostay <tatiana.gornostay@tilde.lv>, Yves Savourel <ysavourel@enlaso.com>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Raivis Skadiņš <Raivis.Skadins@tilde.lv>, Andrejs Vasiļjevs <Andrejs@tilde.lv>
- Message-ID: <50741224.6060108@ijs.si>
Hi, all, (reply inline) On 09. 10. 2012 09:15, Felix Sasaki wrote: > Hi Mārcis, > > 2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv > <mailto:marcis.pinnis@tilde.lv>> > > Hi Felix, > > I believe that the “processInfo” (if renamed from “toolInfo”) will > not overlap with provenance (although, I do not think that process > is the right name – annotatorInfo would sound more reasonable). > Provenance is something that is assigned to a term (a specific > concept) by an authority and not the annotation or an annotation > tool/user. That is, a user could mark a term, but he would not be > responsible for the provenance of the term as that is assigned to > the term in a term bank by someone with rights to do so (or the > creator of the term). Also, provenance for terms is already given > in a term bank, thus we would not need to standardize something > that can be referenced to (following your thought of what can be > referenced and what should be standardized). However, for > automated processes it can be useful to know, how trustworthy an > annotation is. This can be done in two ways – 1) follow a term > bank reference and check the provenance for terms that are linked > to a term bank entry; 2) decide based on the annotator, how > trustworthy the term might be (for term candidates and terms not > linked to a term bank entry). > > I hope our understanding of what provenance in this case is does > not differ (I am referring to term provenance)?! In the case if by > provenance You meant something like the “annotation’s provenance”, > then I agree that, by identifying the annotator, we will also add > an annotation provenance. However, automated systems can benefit > if the source of the content annotation can be identified (or at > least traced...). What are your thoughts in this matter? How much > do you want to ensure traceability in ITS? > > > > I would like to keep the principle of disjunct data categories, and > leave it to applications to interrelate provenance information for the > content. Wrt to tracebility of ITS information, yes, I agree - that > IMO would be the main use case for tool information. The question > whether traceability can be assured "only" via an URI, see > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html > > Mārcis, Tadej, David, ... any thoughts? > As I understand, we're dealing with: 1) provenance of term itself 2) provenance of an instance annotation of the term in some text 1 is probably out of scope, 2 is something that we'd cover by the toolInfo/processInfo attribute. Maybe 1) is also interesting in some cases, but I would speculate that it's rarely something I'd want to inline in a document with an annotation. Also, would 'agent' be a clearer term for 'tool info' or 'process info'? -- Tadej > Felix > > About Translate, I meant the understanding from a machine user’s > perspective. For a machine user (MT system) 1) and 2) may be > equally important and it would be good if the machine user would > be able to distinguish the two types within a document. If I > understand locNote correctly, this category is not meant for > machine users, but rather human translators. > > Best regards, > > Mārcis ;o) > > *From:*Felix Sasaki [mailto:fsasaki@w3.org <mailto:fsasaki@w3.org>] > *Sent:* Thursday, October 04, 2012 6:40 PM > > > *To:* Mārcis Pinnis > *Cc:* Tatiana Gornostay; Yves Savourel; > public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš; Andrejs > Vasiļjevs > *Subject:* Re: [ISSUE-42] Wording for the tool information markup > > Hi Mārcis, > > your mail did not reach the list. Just FIY, I think you were > subscribed to the list with need to send it with > > marcis.pinnis@Tilde.lv <mailto:marcis.pinnis@Tilde.lv> (with upper > case "T" in tilde.) I changed that to marcis.pinnis@tilde.lv > <mailto:marcis.pinnis@tilde.lv>, so your next mail should reach > the list. Some comments below. > > 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv > <mailto:marcis.pinnis@tilde.lv>> > > Dear Felix, > > Thank you for the explanation. I see that the toolinfo can manage > the identification of toos. But does ITS also require users > (people) to be treated as tools. > > We could rename "tool" to process - and would end up with > provenance. But maybe that's sufficient. > > That was not clear to me. Or, does ITS specify separate tags > for identification of who/what added an annotation? > > No, that's exactly the point: we don't have a way to specify "who > created an annotation?". The purpose of "tool info" is just that. > And it is - to use that nice word again - "orthogonal" to the data > category annotation itself. That is, you want to relate it to > its:term, but you don't want to repeat it all the time, and you > don't want to make it mandatory. > > I guess, it is clear that a “termConfidence” is necessary. And > the “term” tag is required (the termCandidate can be ommited > as that could potentially be redundant if a reference of the > annotator or the authority of annotation is given). > > On the Translate question maybe you can explain a bit more > why, in your opinion, the 1) and 2) should be combined in a > general meaning? They both describe data that has to be > handled differently. The “Translate” category as I understand > solves either 1) or 2) (and this depends on every > implementation), but not both. > > Yes, that was my point: we leave it to the implementation whether > the implementation wants to handle 1) or 2). The main idea of ITS > is specify really atomic metadata items. > > Your requirement to differentiate 1) vs. 2) could e.g. be handled > by a localization note: > > <its:locNoteRule selector="//h:img" locNote="Drop this in the > workflow, don't give it to translator"/> > > But you are probably looking for a machine readable way to achieve > this? > > Best, > > Felix > > Best regards, > > Mārcis. > > *From:*Felix Sasaki [mailto:fsasaki@w3.org > <mailto:fsasaki@w3.org>] > *Sent:* Thursday, October 04, 2012 3:58 PM > *To:* Mārcis Pinnis > *Cc:* Tatiana Gornostay; Yves Savourel; > public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš; > Andrejs Vasiļjevs > > > *Subject:* Re: [ISSUE-42] Wording for the tool information markup > > 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv > <mailto:marcis.pinnis@tilde.lv>> > > Dear Felix, > > Having only the confidence distinguishing between an > automatically identified term and a user approved term is not > enough as various term annotation tools can have different > confidence scores (they may be also in log form depending on > the implementation). Thus having a strict value “1” for user > approved/ term-bank based terms is not enough. In an ideal > scenario, at least from my perspective, there should be a way > to identify who (a system, which system, a user, who?, and > authority, which authority?) annotated each term (not just in > document level, but also in individual term level) and what is > the confidence of the respective identifier given to the term > candidate (or even a term). > > Understand. That might bring us to "toolinfo" again. The > solution that Yves mentioned at > > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html > > would allow you to create identifiers for this complex type of > information. > > To make it a bit more simple, using only termConfidence to > distinguish between user approved or trusted terms is not > enough as the termConfidence is not reliable for such > purposes. > > A natural representation, in my opinion, would identify > the “annotator” (using categories – term bank, user, > automatic tool, authority), the term confidence and the ID > of the “annotator” (in order to identify the annotator > precisely). > > Of course, for TermBank based terms there should be also a > reference pointer so that more information could be > identified. > > Understand - the question mainly is: what needs to be > standardized, and what could be a URI to that complex information. > > Actually ... one question that is*out of topic *here ... I > tried following your discussions about the MT related > “Translate” data category and a question arose: do you > distinguish between something that: > > 1)has to be passed through a translation system, but > should not be translated (should be kept as is, but is > helpful for disambiguation of the translatable parts); > > 2)has to be completely ignored and not even passed through > a translation system (for instance, numbers in tables, > encrypted images within HTML5, etc.). > > From what I have understood (maybe I did not get the full > picture) – the “Translate” tag is meant only for an MT > system to tell it that something has to be kept as is, but > some parts could be irrelevant to send through the MT > systems, but that is not solved by the Translate tag. > > "Translate" in fact is very general and doesn't distinguish > between 1) and 2). E.g. IIRC, in Okapi it is used also to > create pseudo translated text. > > Best, > > > Felix > > Best regards, > > Mārcis Pinnis > > Researcher > > Tilde > > *From:*Felix Sasaki [mailto:fsasaki@w3.org > <mailto:fsasaki@w3.org>] > *Sent:* Thursday, October 04, 2012 2:54 PM > *To:* Tatiana Gornostay > *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org>; Mārcis Pinnis; > Raivis Skadiņš; Andrejs Vasiļjevs > > > *Subject:* Re: [ISSUE-42] Wording for the tool information > markup > > Dear Tatiana, all, > > 2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv > <mailto:tatiana.gornostay@tilde.lv>> > > Dear Felix, Yves, Dear All, > > W.r.t. the ongoing discussion on /toolInfo/ and > /mtConfidence/, I have in mind the following potential > attributes proposed by Tilde in view of terminology use > case, I mean, /its-termInfoRef/, /its-termCandidate/, and > /its-termConfidence/ and their values. > > Would it also work to just add "termConfidence" to > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation > > we then could say: something is a term then the confidence > is 1, that is > > <span its:term="yes" its:termInfoRef="...">...</span> (ITS > 1.0 or ITS 2.0) > > is equal to > > <span its:term="yes" its:termInfoRef="..." > termConfidence="1">...</span> (ITS 2.0) > > and a term candidate would be > > <span its:term="yes" its:termInfoRef="..." > termConfidence="0.9">...</span> (ITS 2.0) > > Felix > > These are not represented in the current draft and if > we go this way then we will have to discuss and, > probably, add them. I can remember that Tadej raised > this questionin Prague and we did not talk about it, > unfortunately. On the other hand, as soon as we start > the project we will have opportunity and time to do it > and my colleagues will also join the discussion. > > With best wishes, > > Tatiana > > *From:*Felix Sasaki [mailto:fsasaki@w3.org > <mailto:fsasaki@w3.org>] > *Sent:* Wednesday, October 03, 2012 12:29 AM > *To:* Yves Savourel > *Cc:* public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org> > > > *Subject:* Re: [ISSUE-42] Wording for the tool > information markup > > Hi Yves, all, > > no opinion on my side on the delimiter topic, sorry > for bringing it up. A comment on the tool specific > aspect below. > > 2012/10/2 Yves Savourel <ysavourel@enlaso.com > <mailto:ysavourel@enlaso.com>> > > > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1" > > xlmns:its="http://www.w3.org/2005/11/its"> > > > > > Would it make sense to use a different delimiter? "/" may conflict with > "/" in paths. > > Hmm... almost any ASCII delimiter may also be in the > path. The first occurrence is the delimiter. > But I suppose '|' could be used instead. It just > doesn't look as graceful for some reason. > > > > > Do you need the "dataCategory" attribute? It seems the > > data category is made explicit via the reference > mechanism in "its:toolRefs". > > Also, dropping the "dataCategory" attribute allows > then to refer to > > the same tools from various data categories - e.g. > OKAPI used for quality > > issue versus for creating translation metadata etc. > > I'm not sure we can go from many data category > instances to one tool information. And this is where > I'm having trouble with tool information: > > The mtConfidence need to have a defined way to specify > the engine used > > Is there really a defined way? The current version of > the draft at > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation > > says: > > "Some examples of values are: > > A BCP 47 language tag with t-extension, e.g. ja-t-it > for an Italian to Japanese MT engine > > A Domain as per the Section 6.9: Domain > > A privately structured string, eg. > Domain:IT-Pair:IT-JA, IT-JA:Medical, etc." > > To me that is the same as saying: you can use > anything. Of course we can wrap the "anything" in a > field saying "here is MT engine information". Is that > what you mean? > > , the Text analysis may need something else > > I actually doubt that the text analysis "anything" > will be more specific. My prediction is that there > will be not more interop than saying "in this field > there is data category specific information: ...". > > So you could achieve that by changing your proposal > like this > > > > <its:processInfo> > > > > > > <its:toolInfo xml:id="T1"> > > <its:toolName>Bing Translator</its:toolName> > > <its:toolVersion>123</its:toolVersion> > > <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo> > > > > > > > <its:toolInfo> > > <its:toolInfo xml:id="T2"> > > <its:toolName>myMT</its:toolName> > > <its:toolVersion>456</its:toolVersion> > > <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo> > > > > <its:toolInfo> > > > > > > > > <its:processInfo> > > and allow for several addInfo elements in one > "toolInfo". You won't gain a lot from these, but not > less as with "FR-to-EN-General" inside "toolValue" at > > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html > > Best, > > Felix > > , etc. It seems each data category will need one > or two entry that mean different things depending > on the data category. We can use a common element > for this, but then we need to have one tool > information per data category. > > Maybe the examples people are working on (action > items 239 to 243 for Arle, Phil, Declan and Tadej) > will help in defining this. > > Cheers > -yves > > > > -- > Felix Sasaki > > DFKI / W3C Fellow > > > > -- > Felix Sasaki > > DFKI / W3C Fellow > > > > -- > Felix Sasaki > > DFKI / W3C Fellow > > > > -- > Felix Sasaki > > DFKI / W3C Fellow > > > > > -- > Felix Sasaki > DFKI / W3C Fellow >
Received on Tuesday, 9 October 2012 12:03:00 UTC