Re: [ISSUE-42] Wording for the tool information markup from Tadej Štajner on 2012-10-09 (public-multilingualweb-lt@w3.org from October 2012)

From: Tadej Štajner <tadej.stajner@ijs.si>
Date: Tue, 09 Oct 2012 14:01:40 +0200
To: Felix Sasaki <fsasaki@w3.org>
CC: Mārcis Pinnis <marcis.pinnis@tilde.lv>, Tatiana Gornostay <tatiana.gornostay@tilde.lv>, Yves Savourel <ysavourel@enlaso.com>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Raivis Skadiņš <Raivis.Skadins@tilde.lv>, Andrejs Vasiļjevs <Andrejs@tilde.lv>
Message-ID: <50741224.6060108@ijs.si>
Hi, all,
(reply inline)

On 09. 10. 2012 09:15, Felix Sasaki wrote:
> Hi Mārcis,
>
> 2012/10/8 Mārcis Pinnis <marcis.pinnis@tilde.lv 
> <mailto:marcis.pinnis@tilde.lv>>
>
>     Hi Felix,
>
>     I believe that the “processInfo” (if renamed from “toolInfo”) will
>     not overlap with provenance (although, I do not think that process
>     is the right name – annotatorInfo would sound more reasonable).
>     Provenance is something that is assigned to a term (a specific
>     concept) by an authority and not the annotation or an annotation
>     tool/user. That is, a user could mark a term, but he would not be
>     responsible for the provenance of the term as that is assigned to
>     the term in a term bank by someone with rights to do so (or the
>     creator of the term). Also, provenance for terms is already given
>     in a term bank, thus we would not need to standardize something
>     that can be referenced to (following your thought of what can be
>     referenced and what should be standardized). However, for
>     automated processes it can be useful to know, how trustworthy an
>     annotation is. This can be done in two ways – 1) follow a term
>     bank reference and check the provenance for terms that are linked
>     to a term bank entry; 2) decide based on the annotator, how
>     trustworthy the term might be (for term candidates and terms not
>     linked to a term bank entry).
>
>     I hope our understanding of what provenance in this case is does
>     not differ (I am referring to term provenance)?! In the case if by
>     provenance You meant something like the “annotation’s provenance”,
>     then I agree that, by identifying the annotator, we will also add
>     an annotation provenance. However, automated systems can benefit
>     if the source of the content annotation can be identified (or at
>     least traced...). What are your thoughts in this matter? How much
>     do you want to ensure traceability in ITS?
>
>
>
> I would like to keep the principle of disjunct data categories, and 
> leave it to applications to interrelate provenance information for the 
> content. Wrt to tracebility of ITS information, yes, I agree - that 
> IMO would be the main use case for tool information. The question 
> whether traceability can be assured "only" via an URI, see
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
>
>  Mārcis, Tadej, David,  ... any thoughts?
>

As I understand, we're dealing with:
1) provenance of term itself
2) provenance of an instance annotation of the term in some text

1 is probably out of scope, 2 is something that we'd cover by the 
toolInfo/processInfo attribute. Maybe 1) is also interesting in some 
cases, but I would speculate that it's rarely something I'd want to 
inline in a document with an annotation.

Also, would 'agent' be a clearer term for 'tool info' or 'process info'?

-- Tadej


> Felix
>
>     About Translate, I meant the understanding from a machine user’s
>     perspective. For a machine user (MT system) 1) and 2) may be
>     equally important and it would be good if the machine user would
>     be able to distinguish the two types within a document. If I
>     understand locNote correctly, this category is not meant for
>     machine users, but rather human translators.
>
>     Best regards,
>
>     Mārcis ;o)
>
>     *From:*Felix Sasaki [mailto:fsasaki@w3.org <mailto:fsasaki@w3.org>]
>     *Sent:* Thursday, October 04, 2012 6:40 PM
>
>
>     *To:* Mārcis Pinnis
>     *Cc:* Tatiana Gornostay; Yves Savourel;
>     public-multilingualweb-lt@w3.org
>     <mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš; Andrejs
>     Vasiļjevs
>     *Subject:* Re: [ISSUE-42] Wording for the tool information markup
>
>     Hi Mārcis,
>
>     your mail did not reach the list. Just FIY, I think you were
>     subscribed to the list with need to send it with
>
>     marcis.pinnis@Tilde.lv <mailto:marcis.pinnis@Tilde.lv> (with upper
>     case "T" in tilde.) I changed that to marcis.pinnis@tilde.lv
>     <mailto:marcis.pinnis@tilde.lv>, so your next mail should reach
>     the list. Some comments below.
>
>     2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv
>     <mailto:marcis.pinnis@tilde.lv>>
>
>     Dear Felix,
>
>     Thank you for the explanation. I see that the toolinfo can manage
>     the identification of toos. But does ITS also require users
>     (people) to be treated as tools.
>
>     We could rename "tool" to process - and would end up with
>     provenance. But maybe that's sufficient.
>
>         That was not clear to me. Or, does ITS specify separate tags
>         for identification of who/what added an annotation?
>
>     No, that's exactly the point: we don't have a way to specify "who
>     created an annotation?". The purpose of "tool info" is just that.
>     And it is - to use that nice word again - "orthogonal" to the data
>     category annotation itself. That is, you want to relate it to
>     its:term, but you don't want to repeat it all the time, and you
>     don't want to make it mandatory.
>
>         I guess, it is clear that a “termConfidence” is necessary. And
>         the “term” tag is required (the termCandidate can be ommited
>         as that could potentially be redundant if a reference of the
>         annotator or the authority of annotation is given).
>
>         On the Translate question maybe you can explain a bit more
>         why, in your opinion, the 1) and 2) should be combined in a
>         general meaning? They both describe data that has to be
>         handled differently. The “Translate” category as I understand
>         solves either 1) or 2) (and this depends on every
>         implementation), but not both.
>
>     Yes, that was my point: we leave it to the implementation whether
>     the implementation wants to handle 1) or 2). The main idea of ITS
>     is specify really atomic metadata items.
>
>     Your requirement to differentiate 1) vs. 2) could e.g. be handled
>     by a localization note:
>
>     <its:locNoteRule selector="//h:img" locNote="Drop this in the
>     workflow, don't give it to translator"/>
>
>     But you are probably looking for a machine readable way to achieve
>     this?
>
>     Best,
>
>     Felix
>
>         Best regards,
>
>         Mārcis.
>
>         *From:*Felix Sasaki [mailto:fsasaki@w3.org
>         <mailto:fsasaki@w3.org>]
>         *Sent:* Thursday, October 04, 2012 3:58 PM
>         *To:* Mārcis Pinnis
>         *Cc:* Tatiana Gornostay; Yves Savourel;
>         public-multilingualweb-lt@w3.org
>         <mailto:public-multilingualweb-lt@w3.org>; Raivis Skadiņš;
>         Andrejs Vasiļjevs
>
>
>         *Subject:* Re: [ISSUE-42] Wording for the tool information markup
>
>         2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv
>         <mailto:marcis.pinnis@tilde.lv>>
>
>         Dear Felix,
>
>         Having only the confidence distinguishing between an
>         automatically identified term and a user approved term is not
>         enough as various term annotation tools can have different
>         confidence scores (they may be also in log form depending on
>         the implementation). Thus having a strict value “1” for user
>         approved/ term-bank based terms is not enough. In an ideal
>         scenario, at least from my perspective, there should be a way
>         to identify who (a system, which system, a user, who?, and
>         authority, which authority?) annotated each term (not just in
>         document level, but also in individual term level) and what is
>         the confidence of the respective identifier given to the term
>         candidate (or even a term).
>
>         Understand. That might bring us to "toolinfo" again. The
>         solution that Yves mentioned at
>
>         http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
>
>         would allow you to create identifiers for this complex type of
>         information.
>
>             To make it a bit more simple, using only termConfidence to
>             distinguish between user approved or trusted terms is not
>             enough as the termConfidence is not reliable for such
>             purposes.
>
>             A natural representation, in my opinion, would identify
>             the “annotator” (using categories – term bank, user,
>             automatic tool, authority), the term confidence and the ID
>             of the “annotator” (in order to identify the annotator
>             precisely).
>
>             Of course, for TermBank based terms there should be also a
>             reference pointer so that more information could be
>             identified.
>
>         Understand - the question mainly is: what needs to be
>         standardized, and what could be a URI to that complex information.
>
>             Actually ... one question that is*out of topic *here ... I
>             tried following your discussions about the MT related
>             “Translate” data category and a question arose: do you
>             distinguish between something that:
>
>             1)has to be passed through a translation system, but
>             should not be translated (should be kept as is, but is
>             helpful for disambiguation of the translatable parts);
>
>             2)has to be completely ignored and not even passed through
>             a translation system (for instance, numbers in tables,
>             encrypted images within HTML5, etc.).
>
>             From what I have understood (maybe I did not get the full
>             picture) – the “Translate” tag is meant only for an MT
>             system to tell it that something has to be kept as is, but
>             some parts could be irrelevant to send through the MT
>             systems, but that is not solved by the Translate tag.
>
>         "Translate" in fact is very general and doesn't distinguish
>         between 1) and 2). E.g. IIRC, in Okapi it is used also to
>         create pseudo translated text.
>
>         Best,
>
>
>         Felix
>
>             Best regards,
>
>             Mārcis Pinnis
>
>             Researcher
>
>             Tilde
>
>             *From:*Felix Sasaki [mailto:fsasaki@w3.org
>             <mailto:fsasaki@w3.org>]
>             *Sent:* Thursday, October 04, 2012 2:54 PM
>             *To:* Tatiana Gornostay
>             *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org
>             <mailto:public-multilingualweb-lt@w3.org>; Mārcis Pinnis;
>             Raivis Skadiņš; Andrejs Vasiļjevs
>
>
>             *Subject:* Re: [ISSUE-42] Wording for the tool information
>             markup
>
>             Dear Tatiana, all,
>
>             2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv
>             <mailto:tatiana.gornostay@tilde.lv>>
>
>             Dear Felix, Yves, Dear All,
>
>             W.r.t. the ongoing discussion on /toolInfo/ and
>             /mtConfidence/, I have in mind the following potential
>             attributes proposed by Tilde in view of terminology use
>             case, I mean, /its-termInfoRef/, /its-termCandidate/, and
>             /its-termConfidence/ and their values.
>
>             Would it also work to just add "termConfidence" to
>
>             http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation
>
>             we then could say: something is a term then the confidence
>             is 1, that is
>
>             <span its:term="yes" its:termInfoRef="...">...</span> (ITS
>             1.0 or ITS 2.0)
>
>             is equal to
>
>             <span its:term="yes" its:termInfoRef="..."
>             termConfidence="1">...</span> (ITS 2.0)
>
>             and a term candidate would be
>
>             <span its:term="yes" its:termInfoRef="..."
>             termConfidence="0.9">...</span> (ITS 2.0)
>
>             Felix
>
>                 These are not represented in the current draft  and if
>                 we go this way then we will have to discuss and,
>                 probably, add them. I can remember that Tadej raised
>                 this  questionin Prague and we did not talk about it,
>                 unfortunately. On the other hand, as soon as we start
>                 the project we will have opportunity and time to do it
>                 and my colleagues will also join the discussion.
>
>                 With best wishes,
>
>                 Tatiana
>
>                 *From:*Felix Sasaki [mailto:fsasaki@w3.org
>                 <mailto:fsasaki@w3.org>]
>                 *Sent:* Wednesday, October 03, 2012 12:29 AM
>                 *To:* Yves Savourel
>                 *Cc:* public-multilingualweb-lt@w3.org
>                 <mailto:public-multilingualweb-lt@w3.org>
>
>
>                 *Subject:* Re: [ISSUE-42] Wording for the tool
>                 information markup
>
>                 Hi Yves, all,
>
>                 no opinion on my side on the delimiter topic, sorry
>                 for bringing it up. A comment on the tool specific
>                 aspect below.
>
>                 2012/10/2 Yves Savourel <ysavourel@enlaso.com
>                 <mailto:ysavourel@enlaso.com>>
>
>                 > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1"
>                 > xlmns:its="http://www.w3.org/2005/11/its">
>                 >
>
>                 > Would it make sense to use a different delimiter? "/" may conflict with
>                 "/" in paths.
>
>                 Hmm... almost any ASCII delimiter may also be in the
>                 path. The first occurrence is the delimiter.
>                 But I suppose '|' could be used instead. It just
>                 doesn't look as graceful for some reason.
>
>
>
>                 > Do you need the "dataCategory" attribute? It seems the
>                 > data category is made explicit via the reference
>                 mechanism in "its:toolRefs".
>                 > Also, dropping the "dataCategory" attribute allows
>                 then to refer to
>                 > the same tools from various data categories - e.g.
>                 OKAPI used for quality
>                 > issue versus for creating translation metadata etc.
>
>                 I'm not sure we can go from many data category
>                 instances to one tool information. And this is where
>                 I'm having trouble with tool information:
>
>                 The mtConfidence need to have a defined way to specify
>                 the engine used
>
>                 Is there really a defined way? The current version of
>                 the draft at
>
>                 http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation
>
>                 says:
>
>                 "Some examples of values are:
>
>                 A BCP 47 language tag with t-extension, e.g. ja-t-it
>                 for an Italian to Japanese MT engine
>
>                 A Domain as per the Section 6.9: Domain
>
>                 A privately structured string, eg.
>                 Domain:IT-Pair:IT-JA, IT-JA:Medical, etc."
>
>                 To me that is the same as saying: you can use
>                 anything. Of course we can wrap the "anything" in a
>                 field saying "here is MT engine information". Is that
>                 what you mean?
>
>                     , the Text analysis may need something else
>
>                 I actually doubt that the text analysis "anything"
>                 will be more specific. My prediction is that there
>                 will be not more interop than saying "in this field
>                 there is data category specific information: ...".
>
>                 So you could achieve that by changing your proposal
>                 like this
>
>                   
>
>                 <its:processInfo>
>
>                   
>
>                   
>
>                   <its:toolInfo xml:id="T1">
>
>                    <its:toolName>Bing Translator</its:toolName>
>
>                    <its:toolVersion>123</its:toolVersion>
>
>                    <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>
>
>
>                   
>
>                   
>
>                   <its:toolInfo>
>
>                   <its:toolInfo xml:id="T2">
>
>                    <its:toolName>myMT</its:toolName>
>
>                    <its:toolVersion>456</its:toolVersion>
>
>                    <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>
>
>                   
>
>                   <its:toolInfo>
>
>                   
>
>                   
>
>                   
>
>                 <its:processInfo>
>
>                 and allow for several addInfo elements in one
>                 "toolInfo". You won't gain a lot from these, but not
>                 less as with "FR-to-EN-General" inside "toolValue" at
>
>                 http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html
>
>                 Best,
>
>                 Felix
>
>                     , etc. It seems each data category will need one
>                     or two entry that mean different things depending
>                     on the data category. We can use a common element
>                     for this, but then we need to have one tool
>                     information per data category.
>
>                     Maybe the examples people are working on (action
>                     items 239 to 243 for Arle, Phil, Declan and Tadej)
>                     will help in defining this.
>
>                     Cheers
>                     -yves
>
>
>
>                 -- 
>                 Felix Sasaki
>
>                 DFKI / W3C Fellow
>
>
>
>             -- 
>             Felix Sasaki
>
>             DFKI / W3C Fellow
>
>
>
>         -- 
>         Felix Sasaki
>
>         DFKI / W3C Fellow
>
>
>
>     -- 
>     Felix Sasaki
>
>     DFKI / W3C Fellow
>
>
>
>
> -- 
> Felix Sasaki
> DFKI / W3C Fellow
>
Received on Tuesday, 9 October 2012 12:03:00 UTC