- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 9 Oct 2012 09:33:49 +0200
- To: public-multilingualweb-lt@w3.org
- Message-ID: <CAL58czqTLLUX=by_-SmDNNmHzmYXvzfVPMhVpd4GM3zwdpUeqQ@mail.gmail.com>
P.S. (sorry, had missed a topic, different subject here): 2012/10/9 Felix Sasaki <fsasaki@w3.org> > Hi Mārcis, > > 2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv> > >> Hi Felix,**** >> >> ** ** >> >> I believe that the “processInfo” (if renamed from “toolInfo”) will not >> overlap with provenance (although, I do not think that process is the right >> name – annotatorInfo would sound more reasonable). Provenance is something >> that is assigned to a term (a specific concept) by an authority and not the >> annotation or an annotation tool/user. That is, a user could mark a term, >> but he would not be responsible for the provenance of the term as that is >> assigned to the term in a term bank by someone with rights to do so (or the >> creator of the term). Also, provenance for terms is already given in a term >> bank, thus we would not need to standardize something that can be >> referenced to (following your thought of what can be referenced and what >> should be standardized). However, for automated processes it can be useful >> to know, how trustworthy an annotation is. This can be done in two ways – >> 1) follow a term bank reference and check the provenance for terms that are >> linked to a term bank entry; 2) decide based on the annotator, how >> trustworthy the term might be (for term candidates and terms not linked to >> a term bank entry).**** >> >> ** ** >> >> I hope our understanding of what provenance in this case is does not >> differ (I am referring to term provenance)?! In the case if by provenance >> You meant something like the “annotation’s provenance”, then I agree that, >> by identifying the annotator, we will also add an annotation provenance. >> However, automated systems can benefit if the source of the content >> annotation can be identified (or at least traced...). What are your >> thoughts in this matter? How much do you want to ensure traceability in ITS? >> > > > I would like to keep the principle of disjunct data categories, and leave > it to applications to interrelate provenance information for the content. > Wrt to tracebility of ITS information, yes, I agree - that IMO would be the > main use case for tool information. The question whether traceability can > be assured "only" via an URI, see > > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html > > Mārcis, Tadej, David, ... any thoughts? > > Felix > > **** >> >> ** ** >> >> About Translate, I meant the understanding from a machine user’s >> perspective. For a machine user (MT system) 1) and 2) may be equally >> important and it would be good if the machine user would be able to >> distinguish the two types within a document. If I understand locNote >> correctly, this category is not meant for machine users, but rather human >> translators. >> > I agree with your statements about locNote, and I understand the need to distinguish the two types in a document. What you describe as 2) could be achieved by locale filter http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#LocaleFilter-implementation e.g. <its:rules version="2.0"> <its:localeFilterRule selector="//img" localeFilterList=""/> </its:rules> This expresses that all "img" elements are not part of the localization workflow. Would that fulfil your needs? Best, Felix > **** >> >> ** ** >> >> Best regards,**** >> >> Mārcis ;o)**** >> >> ** ** >> >> *From:* Felix Sasaki [mailto:fsasaki@w3.org] >> *Sent:* Thursday, October 04, 2012 6:40 PM >> >> *To:* Mārcis Pinnis >> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org; >> Raivis Skadiņš; Andrejs Vasiļjevs >> *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** >> >> ** ** >> >> Hi Mārcis,**** >> >> ** ** >> >> your mail did not reach the list. Just FIY, I think you were subscribed >> to the list with need to send it with**** >> >> marcis.pinnis@Tilde.lv (with upper case "T" in tilde.) I changed that to >> marcis.pinnis@tilde.lv, so your next mail should reach the list. Some >> comments below. **** >> >> ** ** >> >> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>**** >> >> Dear Felix,**** >> >> **** >> >> Thank you for the explanation. I see that the toolinfo can manage the >> identification of toos. But does ITS also require users (people) to be >> treated as tools. **** >> >> ** ** >> >> ** ** >> >> We could rename "tool" to process - and would end up with provenance. But >> maybe that's sufficient. **** >> >> ** ** >> >> **** >> >> That was not clear to me. Or, does ITS specify separate tags for >> identification of who/what added an annotation?**** >> >> ** ** >> >> No, that's exactly the point: we don't have a way to specify "who created >> an annotation?". The purpose of "tool info" is just that. And it is - to >> use that nice word again - "orthogonal" to the data category annotation >> itself. That is, you want to relate it to its:term, but you don't want to >> repeat it all the time, and you don't want to make it mandatory.**** >> >> **** >> >> **** >> >> I guess, it is clear that a “termConfidence” is necessary. And the “term” >> tag is required (the termCandidate can be ommited as that could potentially >> be redundant if a reference of the annotator or the authority of annotation >> is given).**** >> >> **** >> >> On the Translate question maybe you can explain a bit more why, in your >> opinion, the 1) and 2) should be combined in a general meaning? They both >> describe data that has to be handled differently. The “Translate” category >> as I understand solves either 1) or 2) (and this depends on every >> implementation), but not both.**** >> >> ** ** >> >> ** ** >> >> Yes, that was my point: we leave it to the implementation whether the >> implementation wants to handle 1) or 2). The main idea of ITS is specify >> really atomic metadata items. **** >> >> ** ** >> >> Your requirement to differentiate 1) vs. 2) could e.g. be handled by a >> localization note:**** >> >> ** ** >> >> <its:locNoteRule selector="//h:img" locNote="Drop this in the workflow, >> don't give it to translator"/>**** >> >> ** ** >> >> But you are probably looking for a machine readable way to achieve this?* >> *** >> >> ** ** >> >> Best,**** >> >> ** ** >> >> Felix **** >> >> **** >> >> **** >> >> Best regards,**** >> >> Mārcis.**** >> >> **** >> >> *From:* Felix Sasaki [mailto:fsasaki@w3.org] >> *Sent:* Thursday, October 04, 2012 3:58 PM >> *To:* Mārcis Pinnis >> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org; >> Raivis Skadiņš; Andrejs Vasiļjevs**** >> >> >> *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** >> >> **** >> >> **** >> >> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>**** >> >> Dear Felix,**** >> >> **** >> >> Having only the confidence distinguishing between an automatically >> identified term and a user approved term is not enough as various term >> annotation tools can have different confidence scores (they may be also in >> log form depending on the implementation). Thus having a strict value “1” >> for user approved/ term-bank based terms is not enough. In an ideal >> scenario, at least from my perspective, there should be a way to identify >> who (a system, which system, a user, who?, and authority, which authority?) >> annotated each term (not just in document level, but also in individual >> term level) and what is the confidence of the respective identifier given >> to the term candidate (or even a term).**** >> >> **** >> >> **** >> >> Understand. That might bring us to "toolinfo" again. The solution that >> Yves mentioned at**** >> >> >> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html >> **** >> >> would allow you to create identifiers for this complex type of >> information. **** >> >> **** >> >> **** >> >> To make it a bit more simple, using only termConfidence to distinguish >> between user approved or trusted terms is not enough as the termConfidence >> is not reliable for such purposes.**** >> >> **** >> >> A natural representation, in my opinion, would identify the “annotator” >> (using categories – term bank, user, automatic tool, authority), the term >> confidence and the ID of the “annotator” (in order to identify the >> annotator precisely).**** >> >> **** >> >> Of course, for TermBank based terms there should be also a reference >> pointer so that more information could be identified.**** >> >> **** >> >> **** >> >> Understand - the question mainly is: what needs to be standardized, and >> what could be a URI to that complex information.**** >> >> **** >> >> **** >> >> **** >> >> **** >> >> **** >> >> Actually ... one question that is* out of topic *here ... I tried >> following your discussions about the MT related “Translate” data category >> and a question arose: do you distinguish between something that:**** >> >> 1) has to be passed through a translation system, but should not be >> translated (should be kept as is, but is helpful for disambiguation of the >> translatable parts);**** >> >> 2) has to be completely ignored and not even passed through a >> translation system (for instance, numbers in tables, encrypted images >> within HTML5, etc.).**** >> >> **** >> >> From what I have understood (maybe I did not get the full picture) – the >> “Translate” tag is meant only for an MT system to tell it that something >> has to be kept as is, but some parts could be irrelevant to send through >> the MT systems, but that is not solved by the Translate tag.**** >> >> **** >> >> "Translate" in fact is very general and doesn't distinguish between 1) >> and 2). E.g. IIRC, in Okapi it is used also to create pseudo translated >> text. **** >> >> **** >> >> Best,**** >> >> >> Felix**** >> >> **** >> >> **** >> >> Best regards,**** >> >> Mārcis Pinnis**** >> >> Researcher**** >> >> Tilde**** >> >> **** >> >> *From:* Felix Sasaki [mailto:fsasaki@w3.org] >> *Sent:* Thursday, October 04, 2012 2:54 PM >> *To:* Tatiana Gornostay >> *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org; Mārcis Pinnis; >> Raivis Skadiņš; Andrejs Vasiļjevs**** >> >> >> *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** >> >> **** >> >> Dear Tatiana, all,**** >> >> 2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv>**** >> >> Dear Felix, Yves, Dear All,**** >> >> **** >> >> W.r.t. the ongoing discussion on *toolInfo* and *mtConfidence*, I have >> in mind the following potential attributes proposed by Tilde in view of >> terminology use case, I mean, *its-termInfoRef*, *its-termCandidate*, >> and *its-termConfidence* and their values. **** >> >> **** >> >> Would it also work to just add "termConfidence" to**** >> >> **** >> >> >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation >> **** >> >> **** >> >> we then could say: something is a term then the confidence is 1, that is >> **** >> >> <span its:term="yes" its:termInfoRef="...">...</span> (ITS 1.0 or ITS 2.0) >> **** >> >> is equal to **** >> >> <span its:term="yes" its:termInfoRef="..." termConfidence="1">...</span> >> (ITS 2.0)**** >> >> and a term candidate would be**** >> >> <span its:term="yes" its:termInfoRef="..." >> termConfidence="0.9">...</span> (ITS 2.0)**** >> >> **** >> >> Felix **** >> >> These are not represented in the current draft and if we go this way >> then we will have to discuss and, probably, add them. I can remember that >> Tadej raised this questionin Prague and we did not talk about it, >> unfortunately. On the other hand, as soon as we start the project we will >> have opportunity and time to do it and my colleagues will also join the >> discussion.**** >> >> **** >> >> With best wishes,**** >> >> Tatiana**** >> >> **** >> >> *From:* Felix Sasaki [mailto:fsasaki@w3.org] >> *Sent:* Wednesday, October 03, 2012 12:29 AM >> *To:* Yves Savourel >> *Cc:* public-multilingualweb-lt@w3.org**** >> >> >> *Subject:* Re: [ISSUE-42] Wording for the tool information markup**** >> >> **** >> >> Hi Yves, all,**** >> >> **** >> >> no opinion on my side on the delimiter topic, sorry for bringing it up. A >> comment on the tool specific aspect below.**** >> >> 2012/10/2 Yves Savourel <ysavourel@enlaso.com>**** >> >> > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1" >> > xlmns:its="http://www.w3.org/2005/11/its"> >> >**** >> >> > Would it make sense to use a different delimiter? "/" may conflict with >> "/" in paths.**** >> >> Hmm... almost any ASCII delimiter may also be in the path. The first >> occurrence is the delimiter. >> But I suppose '|' could be used instead. It just doesn't look as graceful >> for some reason.**** >> >> >> >> > Do you need the "dataCategory" attribute? It seems the >> > data category is made explicit via the reference mechanism in >> "its:toolRefs". >> > Also, dropping the "dataCategory" attribute allows then to refer to >> > the same tools from various data categories - e.g. OKAPI used for >> quality >> > issue versus for creating translation metadata etc.**** >> >> I'm not sure we can go from many data category instances to one tool >> information. And this is where I'm having trouble with tool information: >> >> The mtConfidence need to have a defined way to specify the engine used*** >> * >> >> **** >> >> Is there really a defined way? The current version of the draft at**** >> >> >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation >> **** >> >> says:**** >> >> **** >> >> "Some examples of values are:**** >> >> A BCP 47 language tag with t-extension, e.g. ja-t-it for an Italian to >> Japanese MT engine**** >> >> A Domain as per the Section 6.9: Domain**** >> >> A privately structured string, eg. Domain:IT-Pair:IT-JA, IT-JA:Medical, >> etc."**** >> >> **** >> >> To me that is the same as saying: you can use anything. Of course we can >> wrap the "anything" in a field saying "here is MT engine information". Is >> that what you mean?**** >> >> **** >> >> **** >> >> , the Text analysis may need something else**** >> >> **** >> >> I actually doubt that the text analysis "anything" will be more specific. >> My prediction is that there will be not more interop than saying "in this >> field there is data category specific information: ...". **** >> >> **** >> >> So you could achieve that by changing your proposal like this**** >> >> **** >> >> <its:processInfo>**** >> >> ** ** >> >> **** >> >> <its:toolInfo xml:id="T1">**** >> >> <its:toolName>Bing Translator</its:toolName>**** >> >> <its:toolVersion>123</its:toolVersion>**** >> >> <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo> >> >> **** >> >> ** ** >> >> **** >> >> <its:toolInfo>**** >> >> <its:toolInfo xml:id="T2">**** >> >> <its:toolName>myMT</its:toolName>**** >> >> <its:toolVersion>456</its:toolVersion>**** >> >> <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>**** >> >> **** >> >> <its:toolInfo>**** >> >> ** ** >> >> **** >> >> **** >> >> <its:processInfo>**** >> >> **** >> >> and allow for several addInfo elements in one "toolInfo". You won't gain >> a lot from these, but not less as with "FR-to-EN-General" inside >> "toolValue" at**** >> >> >> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html >> **** >> >> **** >> >> Best,**** >> >> **** >> >> Felix**** >> >> **** >> >> **** >> >> , etc. It seems each data category will need one or two entry that mean >> different things depending on the data category. We can use a common >> element for this, but then we need to have one tool information per data >> category. >> >> Maybe the examples people are working on (action items 239 to 243 for >> Arle, Phil, Declan and Tadej) will help in defining this. >> >> Cheers >> -yves**** >> >> >> >> **** >> >> **** >> >> -- >> Felix Sasaki**** >> >> DFKI / W3C Fellow**** >> >> **** >> >> >> >> **** >> >> **** >> >> -- >> Felix Sasaki**** >> >> DFKI / W3C Fellow**** >> >> **** >> >> >> >> **** >> >> **** >> >> -- >> Felix Sasaki**** >> >> DFKI / W3C Fellow**** >> >> **** >> >> >> >> **** >> >> ** ** >> >> -- >> Felix Sasaki**** >> >> DFKI / W3C Fellow**** >> >> ** ** >> > > > > -- > Felix Sasaki > DFKI / W3C Fellow > > -- Felix Sasaki DFKI / W3C Fellow
Received on Tuesday, 9 October 2012 07:34:18 UTC