"Translate" data category (Re: [ISSUE-42] Wording for the tool information markup) from Felix Sasaki on 2012-10-09 (public-multilingualweb-lt@w3.org from October 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 9 Oct 2012 09:33:49 +0200
To: public-multilingualweb-lt@w3.org
Message-ID: <CAL58czqTLLUX=by_-SmDNNmHzmYXvzfVPMhVpd4GM3zwdpUeqQ@mail.gmail.com>
P.S. (sorry, had missed a topic, different subject here):

2012/10/9 Felix Sasaki <fsasaki@w3.org>

> Hi Mārcis,
>
> 2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv>
>
>> Hi Felix,****
>>
>> ** **
>>
>> I believe that the “processInfo” (if renamed from “toolInfo”) will not
>> overlap with provenance (although, I do not think that process is the right
>> name – annotatorInfo would sound more reasonable). Provenance is something
>> that is assigned to a term (a specific concept) by an authority and not the
>> annotation or an annotation tool/user. That is, a user could mark a term,
>> but he would not be responsible for the provenance of the term as that is
>> assigned to the term in a term bank by someone with rights to do so (or the
>> creator of the term). Also, provenance for terms is already given in a term
>> bank, thus we would not need to standardize something that can be
>> referenced to (following your thought of what can be referenced and what
>> should be standardized). However, for automated processes it can be useful
>> to know, how trustworthy an annotation is. This can be done in two ways –
>> 1) follow a term bank reference and check the provenance for terms that are
>> linked to a term bank entry; 2) decide based on the annotator, how
>> trustworthy the term might be (for term candidates and terms not linked to
>> a term bank entry).****
>>
>> ** **
>>
>> I hope our understanding of what provenance in this case is does not
>> differ (I am referring to term provenance)?! In the case if by provenance
>> You meant something like the “annotation’s provenance”, then I agree that,
>> by identifying the annotator, we will also add an annotation provenance.
>> However, automated systems can benefit if the source of the content
>> annotation can be identified (or at least traced...). What are your
>> thoughts in this matter? How much do you want to ensure traceability in ITS?
>>
>
>
> I would like to keep the principle of disjunct data categories, and leave
> it to applications to interrelate provenance information for the content.
> Wrt to tracebility of ITS information, yes, I agree - that IMO would be the
> main use case for tool information. The question whether traceability can
> be assured "only" via an URI, see
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
>
>  Mārcis, Tadej, David,  ... any thoughts?
>
> Felix
>
> ****
>>
>> ** **
>>
>> About Translate, I meant the understanding from a machine user’s
>> perspective. For a machine user (MT system) 1) and 2) may be equally
>> important and it would be good if the machine user would be able to
>> distinguish the two types within a document. If I understand locNote
>> correctly, this category is not meant for machine users, but rather human
>> translators.
>>
>

I agree with your statements about locNote, and I understand the need to
distinguish the two types in a document. What you describe as 2) could be
achieved by locale filter
http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#LocaleFilter-implementation
e.g.
<its:rules version="2.0"> <its:localeFilterRule selector="//img"
localeFilterList=""/> </its:rules>
This expresses that all "img" elements are not part of the localization
workflow. Would that fulfil your needs?

Best,

Felix



> ****
>>
>> ** **
>>
>> Best regards,****
>>
>> Mārcis ;o)****
>>
>> ** **
>>
>> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
>> *Sent:* Thursday, October 04, 2012 6:40 PM
>>
>> *To:* Mārcis Pinnis
>> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org;
>> Raivis Skadiņš; Andrejs Vasiļjevs
>> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>>
>> ** **
>>
>> Hi Mārcis,****
>>
>> ** **
>>
>> your mail did not reach the list. Just FIY, I think you were subscribed
>> to the list with need to send it with****
>>
>> marcis.pinnis@Tilde.lv (with upper case "T" in tilde.) I changed that to
>> marcis.pinnis@tilde.lv, so your next mail should reach the list. Some
>> comments below. ****
>>
>> ** **
>>
>> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>>
>> Dear Felix,****
>>
>>  ****
>>
>> Thank you for the explanation. I see that the toolinfo can manage the
>> identification of toos. But does ITS also require users (people) to be
>> treated as tools. ****
>>
>> ** **
>>
>> ** **
>>
>> We could rename "tool" to process - and would end up with provenance. But
>> maybe that's sufficient. ****
>>
>> ** **
>>
>>  ****
>>
>> That was not clear to me. Or, does ITS specify separate tags for
>> identification of who/what added an annotation?****
>>
>> ** **
>>
>> No, that's exactly the point: we don't have a way to specify "who created
>> an annotation?". The purpose of "tool info" is just that. And it is - to
>> use that nice word again - "orthogonal" to the data category annotation
>> itself. That is, you want to relate it to its:term, but you don't want to
>> repeat it all the time, and you don't want to make it mandatory.****
>>
>>  ****
>>
>>  ****
>>
>> I guess, it is clear that a “termConfidence” is necessary. And the “term”
>> tag is required (the termCandidate can be ommited as that could potentially
>> be redundant if a reference of the annotator or the authority of annotation
>> is given).****
>>
>>  ****
>>
>> On the Translate question maybe you can explain a bit more why, in your
>> opinion, the 1) and 2) should be combined in a general meaning? They both
>> describe data that has to be handled differently. The “Translate” category
>> as I understand solves either 1) or 2) (and this depends on every
>> implementation), but not both.****
>>
>> ** **
>>
>> ** **
>>
>> Yes, that was my point: we leave it to the implementation whether the
>> implementation wants to handle 1) or 2). The main idea of ITS is specify
>> really atomic metadata items. ****
>>
>> ** **
>>
>> Your requirement to differentiate 1) vs. 2) could e.g. be handled by a
>> localization note:****
>>
>> ** **
>>
>> <its:locNoteRule selector="//h:img" locNote="Drop this in the workflow,
>> don't give it to translator"/>****
>>
>> ** **
>>
>> But you are probably looking for a machine readable way to achieve this?*
>> ***
>>
>> ** **
>>
>> Best,****
>>
>> ** **
>>
>> Felix ****
>>
>>  ****
>>
>>  ****
>>
>> Best regards,****
>>
>> Mārcis.****
>>
>>  ****
>>
>> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
>> *Sent:* Thursday, October 04, 2012 3:58 PM
>> *To:* Mārcis Pinnis
>> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org;
>> Raivis Skadiņš; Andrejs Vasiļjevs****
>>
>>
>> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>>
>>  ****
>>
>>  ****
>>
>> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>>
>> Dear Felix,****
>>
>>  ****
>>
>> Having only the confidence distinguishing between an automatically
>> identified term and a user approved term is not enough as various term
>> annotation tools can have different confidence scores (they may be also in
>> log form depending on the implementation). Thus having a strict value “1”
>> for user approved/ term-bank based terms is not enough. In an ideal
>> scenario, at least from my perspective, there should be a way to identify
>> who (a system, which system, a user, who?, and authority, which authority?)
>> annotated each term (not just in document level, but also in individual
>> term level) and what is the confidence of the respective identifier given
>> to the term candidate (or even a term).****
>>
>>  ****
>>
>>  ****
>>
>> Understand. That might bring us to "toolinfo" again. The solution that
>> Yves mentioned at****
>>
>>
>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
>> ****
>>
>> would allow you to create identifiers for this complex type of
>> information. ****
>>
>>  ****
>>
>>  ****
>>
>> To make it a bit more simple, using only termConfidence to distinguish
>> between user approved or trusted terms is not enough as the termConfidence
>> is not reliable for such purposes.****
>>
>>  ****
>>
>> A natural representation, in my opinion, would identify the “annotator”
>> (using categories – term bank, user, automatic tool, authority), the term
>> confidence and the ID of the “annotator” (in order to identify the
>> annotator precisely).****
>>
>>  ****
>>
>> Of course, for TermBank based terms there should be also a reference
>> pointer so that more information could be identified.****
>>
>>  ****
>>
>>  ****
>>
>> Understand - the question mainly is: what needs to be standardized, and
>> what could be a URI to that complex information.****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> Actually ... one question that is* out of topic *here ... I tried
>> following your discussions about the MT related “Translate” data category
>> and a question arose: do you distinguish between something that:****
>>
>> 1)      has to be passed through a translation system, but should not be
>> translated (should be kept as is, but is helpful for disambiguation of the
>> translatable parts);****
>>
>> 2)      has to be completely ignored and not even passed through a
>> translation system (for instance, numbers in tables, encrypted images
>> within HTML5, etc.).****
>>
>>  ****
>>
>> From what I have understood (maybe I did not get the full picture) – the
>> “Translate” tag is meant only for an MT system to tell it that something
>> has to be kept as is, but some parts could be irrelevant to send through
>> the MT systems, but that is not solved by the Translate tag.****
>>
>>  ****
>>
>> "Translate" in fact is very general and doesn't distinguish between 1)
>> and 2). E.g. IIRC, in Okapi it is used also to create pseudo translated
>> text. ****
>>
>>  ****
>>
>> Best,****
>>
>>
>> Felix****
>>
>>  ****
>>
>>  ****
>>
>> Best regards,****
>>
>> Mārcis Pinnis****
>>
>> Researcher****
>>
>> Tilde****
>>
>>  ****
>>
>> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
>> *Sent:* Thursday, October 04, 2012 2:54 PM
>> *To:* Tatiana Gornostay
>> *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org; Mārcis Pinnis;
>> Raivis Skadiņš; Andrejs Vasiļjevs****
>>
>>
>> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>>
>>  ****
>>
>> Dear Tatiana, all,****
>>
>> 2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv>****
>>
>> Dear Felix, Yves, Dear All,****
>>
>>  ****
>>
>> W.r.t. the ongoing discussion on *toolInfo* and *mtConfidence*, I have
>> in mind the following potential attributes proposed by Tilde in view of
>> terminology use case, I mean, *its-termInfoRef*, *its-termCandidate*,
>> and *its-termConfidence* and their values. ****
>>
>>  ****
>>
>> Would it also work to just add "termConfidence" to****
>>
>>  ****
>>
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation
>> ****
>>
>>  ****
>>
>> we then could say: something is a term then the confidence is 1, that is
>> ****
>>
>> <span its:term="yes" its:termInfoRef="...">...</span> (ITS 1.0 or ITS 2.0)
>> ****
>>
>> is equal to ****
>>
>> <span its:term="yes" its:termInfoRef="..." termConfidence="1">...</span>
>> (ITS 2.0)****
>>
>> and a term candidate would be****
>>
>> <span its:term="yes" its:termInfoRef="..."
>> termConfidence="0.9">...</span> (ITS 2.0)****
>>
>>  ****
>>
>> Felix ****
>>
>> These are not represented in the current draft  and if we go this way
>> then we will have to discuss and, probably, add them. I can remember that
>> Tadej raised this  questionin Prague and we did not talk about it,
>> unfortunately. On the other hand, as soon as we start the project we will
>> have opportunity and time to do it and my colleagues will also join the
>> discussion.****
>>
>>  ****
>>
>> With best wishes,****
>>
>> Tatiana****
>>
>>  ****
>>
>> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
>> *Sent:* Wednesday, October 03, 2012 12:29 AM
>> *To:* Yves Savourel
>> *Cc:* public-multilingualweb-lt@w3.org****
>>
>>
>> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>>
>>  ****
>>
>> Hi Yves, all,****
>>
>>  ****
>>
>> no opinion on my side on the delimiter topic, sorry for bringing it up. A
>> comment on the tool specific aspect below.****
>>
>> 2012/10/2 Yves Savourel <ysavourel@enlaso.com>****
>>
>> > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1"
>> > xlmns:its="http://www.w3.org/2005/11/its">
>> >****
>>
>> > Would it make sense to use a different delimiter? "/" may conflict with
>> "/" in paths.****
>>
>> Hmm... almost any ASCII delimiter may also be in the path. The first
>> occurrence is the delimiter.
>> But I suppose '|' could be used instead. It just doesn't look as graceful
>> for some reason.****
>>
>>
>>
>> > Do you need the "dataCategory" attribute? It seems the
>> > data category is made explicit via the reference mechanism in
>> "its:toolRefs".
>> > Also, dropping the "dataCategory" attribute allows then to refer to
>> > the same tools from various data categories - e.g. OKAPI used for
>> quality
>> > issue versus for creating translation metadata etc.****
>>
>> I'm not sure we can go from many data category instances to one tool
>> information. And this is where I'm having trouble with tool information:
>>
>> The mtConfidence need to have a defined way to specify the engine used***
>> *
>>
>>  ****
>>
>> Is there really a defined way? The current version of the draft at****
>>
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation
>> ****
>>
>> says:****
>>
>>  ****
>>
>> "Some examples of values are:****
>>
>> A BCP 47 language tag with t-extension, e.g. ja-t-it for an Italian to
>> Japanese MT engine****
>>
>> A Domain as per the Section 6.9: Domain****
>>
>> A privately structured string, eg. Domain:IT-Pair:IT-JA, IT-JA:Medical,
>> etc."****
>>
>>  ****
>>
>> To me that is the same as saying: you can use anything. Of course we can
>> wrap the "anything" in a field saying "here is MT engine information". Is
>> that what you mean?****
>>
>>  ****
>>
>>  ****
>>
>> , the Text analysis may need something else****
>>
>>  ****
>>
>> I actually doubt that the text analysis "anything" will be more specific.
>> My prediction is that there will be not more interop than saying "in this
>> field there is data category specific information: ...".  ****
>>
>>  ****
>>
>> So you could achieve that by changing your proposal like this****
>>
>>  ****
>>
>> <its:processInfo>****
>>
>> ** **
>>
>>  ****
>>
>>  <its:toolInfo xml:id="T1">****
>>
>>   <its:toolName>Bing Translator</its:toolName>****
>>
>>   <its:toolVersion>123</its:toolVersion>****
>>
>>   <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>
>>
>> ****
>>
>> ** **
>>
>>  ****
>>
>>  <its:toolInfo>****
>>
>>  <its:toolInfo xml:id="T2">****
>>
>>   <its:toolName>myMT</its:toolName>****
>>
>>   <its:toolVersion>456</its:toolVersion>****
>>
>>   <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>****
>>
>>  ****
>>
>>  <its:toolInfo>****
>>
>> ** **
>>
>>  ****
>>
>>  ****
>>
>> <its:processInfo>****
>>
>>  ****
>>
>> and allow for several addInfo elements in one "toolInfo". You won't gain
>> a lot from these, but not less as with "FR-to-EN-General" inside
>> "toolValue" at****
>>
>>
>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html
>> ****
>>
>>  ****
>>
>> Best,****
>>
>>  ****
>>
>> Felix****
>>
>>  ****
>>
>>  ****
>>
>> , etc. It seems each data category will need one or two entry that mean
>> different things depending on the data category. We can use a common
>> element for this, but then we need to have one tool information per data
>> category.
>>
>> Maybe the examples people are working on (action items 239 to 243 for
>> Arle, Phil, Declan and Tadej) will help in defining this.
>>
>> Cheers
>> -yves****
>>
>>
>>
>> ****
>>
>>  ****
>>
>> --
>> Felix Sasaki****
>>
>> DFKI / W3C Fellow****
>>
>>  ****
>>
>>
>>
>> ****
>>
>>  ****
>>
>> --
>> Felix Sasaki****
>>
>> DFKI / W3C Fellow****
>>
>>  ****
>>
>>
>>
>> ****
>>
>>  ****
>>
>> --
>> Felix Sasaki****
>>
>> DFKI / W3C Fellow****
>>
>>  ****
>>
>>
>>
>> ****
>>
>> ** **
>>
>> --
>> Felix Sasaki****
>>
>> DFKI / W3C Fellow****
>>
>> ** **
>>
>
>
>
> --
> Felix Sasaki
> DFKI / W3C Fellow
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Tuesday, 9 October 2012 07:34:18 UTC