W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > October 2012

Re: [ISSUE-42] Wording for the tool information markup

From: Felix Sasaki <fsasaki@w3.org>
Date: Thu, 4 Oct 2012 17:40:28 +0200
Message-ID: <CAL58czrqTaAtwdsrJqJmUK2iVE0XcB9BM+=bcg9i-7QsAVgiGg@mail.gmail.com>
To: Mārcis Pinnis <marcis.pinnis@tilde.lv>
Cc: Tatiana Gornostay <tatiana.gornostay@tilde.lv>, Yves Savourel <ysavourel@enlaso.com>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Raivis Skadiņš <Raivis.Skadins@tilde.lv>, Andrejs Vasiļjevs <Andrejs@tilde.lv>
Hi Mārcis,

your mail did not reach the list. Just FIY, I think you were subscribed to
the list with need to send it with
marcis.pinnis@Tilde.lv (with upper case "T" in tilde.) I changed that to
marcis.pinnis@tilde.lv, so your next mail should reach the list. Some
comments below.


2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>

> Dear Felix,****
>
> ** **
>
> Thank you for the explanation. I see that the toolinfo can manage the
> identification of toos. But does ITS also require users (people) to be
> treated as tools.
>


We could rename "tool" to process - and would end up with provenance. But
maybe that's sufficient.



> That was not clear to me. Or, does ITS specify separate tags for
> identification of who/what added an annotation?
>

No, that's exactly the point: we don't have a way to specify "who created
an annotation?". The purpose of "tool info" is just that. And it is - to
use that nice word again - "orthogonal" to the data category annotation
itself. That is, you want to relate it to its:term, but you don't want to
repeat it all the time, and you don't want to make it mandatory.


> ****
>
> ** **
>
> I guess, it is clear that a “termConfidence” is necessary. And the “term”
> tag is required (the termCandidate can be ommited as that could potentially
> be redundant if a reference of the annotator or the authority of annotation
> is given).****
>
> ** **
>
> On the Translate question maybe you can explain a bit more why, in your
> opinion, the 1) and 2) should be combined in a general meaning? They both
> describe data that has to be handled differently. The “Translate” category
> as I understand solves either 1) or 2) (and this depends on every
> implementation), but not both.
>


Yes, that was my point: we leave it to the implementation whether the
implementation wants to handle 1) or 2). The main idea of ITS is specify
really atomic metadata items.

Your requirement to differentiate 1) vs. 2) could e.g. be handled by a
localization note:

<its:locNoteRule selector="//h:img" locNote="Drop this in the workflow,
don't give it to translator"/>

But you are probably looking for a machine readable way to achieve this?

Best,

Felix


> ****
>
> ** **
>
> Best regards,****
>
> Mārcis.****
>
> ** **
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 3:58 PM
> *To:* Mārcis Pinnis
> *Cc:* Tatiana Gornostay; Yves Savourel; public-multilingualweb-lt@w3.org;
> Raivis Skadiņš; Andrejs Vasiļjevs
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
> ** **
>
> ** **
>
> 2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>****
>
> Dear Felix,****
>
>  ****
>
> Having only the confidence distinguishing between an automatically
> identified term and a user approved term is not enough as various term
> annotation tools can have different confidence scores (they may be also in
> log form depending on the implementation). Thus having a strict value “1”
> for user approved/ term-bank based terms is not enough. In an ideal
> scenario, at least from my perspective, there should be a way to identify
> who (a system, which system, a user, who?, and authority, which authority?)
> annotated each term (not just in document level, but also in individual
> term level) and what is the confidence of the respective identifier given
> to the term candidate (or even a term).****
>
> ** **
>
> ** **
>
> Understand. That might bring us to "toolinfo" again. The solution that
> Yves mentioned at****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
> ****
>
> would allow you to create identifiers for this complex type of
> information. ****
>
>  ****
>
>  ****
>
> To make it a bit more simple, using only termConfidence to distinguish
> between user approved or trusted terms is not enough as the termConfidence
> is not reliable for such purposes.****
>
>  ****
>
> A natural representation, in my opinion, would identify the “annotator”
> (using categories – term bank, user, automatic tool, authority), the term
> confidence and the ID of the “annotator” (in order to identify the
> annotator precisely).****
>
>  ****
>
> Of course, for TermBank based terms there should be also a reference
> pointer so that more information could be identified.****
>
> ** **
>
> ** **
>
> Understand - the question mainly is: what needs to be standardized, and
> what could be a URI to that complex information.****
>
> ** **
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Actually ... one question that is* out of topic *here ... I tried
> following your discussions about the MT related “Translate” data category
> and a question arose: do you distinguish between something that:****
>
> 1)      has to be passed through a translation system, but should not be
> translated (should be kept as is, but is helpful for disambiguation of the
> translatable parts);****
>
> 2)      has to be completely ignored and not even passed through a
> translation system (for instance, numbers in tables, encrypted images
> within HTML5, etc.).****
>
>  ****
>
> From what I have understood (maybe I did not get the full picture) – the
> “Translate” tag is meant only for an MT system to tell it that something
> has to be kept as is, but some parts could be irrelevant to send through
> the MT systems, but that is not solved by the Translate tag.****
>
> ** **
>
> "Translate" in fact is very general and doesn't distinguish between 1) and
> 2). E.g. IIRC, in Okapi it is used also to create pseudo translated text.
> ****
>
> ** **
>
> Best,****
>
>
> Felix****
>
>  ****
>
>  ****
>
> Best regards,****
>
> Mārcis Pinnis****
>
> Researcher****
>
> Tilde****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 2:54 PM
> *To:* Tatiana Gornostay
> *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org; Mārcis Pinnis;
> Raivis Skadiņš; Andrejs Vasiļjevs****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Dear Tatiana, all,****
>
> 2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv>****
>
> Dear Felix, Yves, Dear All,****
>
>  ****
>
> W.r.t. the ongoing discussion on *toolInfo* and *mtConfidence*, I have in
> mind the following potential attributes proposed by Tilde in view of
> terminology use case, I mean, *its-termInfoRef*, *its-termCandidate*, and
> *its-termConfidence* and their values. ****
>
>  ****
>
> Would it also work to just add "termConfidence" to****
>
>  ****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation
> ****
>
>  ****
>
> we then could say: something is a term then the confidence is 1, that is *
> ***
>
> <span its:term="yes" its:termInfoRef="...">...</span> (ITS 1.0 or ITS 2.0)
> ****
>
> is equal to ****
>
> <span its:term="yes" its:termInfoRef="..." termConfidence="1">...</span>
> (ITS 2.0)****
>
> and a term candidate would be****
>
> <span its:term="yes" its:termInfoRef="..." termConfidence="0.9">...</span>
> (ITS 2.0)****
>
>  ****
>
> Felix ****
>
> These are not represented in the current draft  and if we go this way then
> we will have to discuss and, probably, add them. I can remember that Tadej
> raised this  questionin Prague and we did not talk about it, unfortunately.
> On the other hand, as soon as we start the project we will have opportunity
> and time to do it and my colleagues will also join the discussion.****
>
>  ****
>
> With best wishes,****
>
> Tatiana****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Wednesday, October 03, 2012 12:29 AM
> *To:* Yves Savourel
> *Cc:* public-multilingualweb-lt@w3.org****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi Yves, all,****
>
>  ****
>
> no opinion on my side on the delimiter topic, sorry for bringing it up. A
> comment on the tool specific aspect below.****
>
> 2012/10/2 Yves Savourel <ysavourel@enlaso.com>****
>
> > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1"
> > xlmns:its="http://www.w3.org/2005/11/its">
> >****
>
> > Would it make sense to use a different delimiter? "/" may conflict with
> "/" in paths.****
>
> Hmm... almost any ASCII delimiter may also be in the path. The first
> occurrence is the delimiter.
> But I suppose '|' could be used instead. It just doesn't look as graceful
> for some reason.****
>
>
>
> > Do you need the "dataCategory" attribute? It seems the
> > data category is made explicit via the reference mechanism in
> "its:toolRefs".
> > Also, dropping the "dataCategory" attribute allows then to refer to
> > the same tools from various data categories - e.g. OKAPI used for quality
> > issue versus for creating translation metadata etc.****
>
> I'm not sure we can go from many data category instances to one tool
> information. And this is where I'm having trouble with tool information:
>
> The mtConfidence need to have a defined way to specify the engine used****
>
>  ****
>
> Is there really a defined way? The current version of the draft at****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation
> ****
>
> says:****
>
>  ****
>
> "Some examples of values are:****
>
> A BCP 47 language tag with t-extension, e.g. ja-t-it for an Italian to
> Japanese MT engine****
>
> A Domain as per the Section 6.9: Domain****
>
> A privately structured string, eg. Domain:IT-Pair:IT-JA, IT-JA:Medical,
> etc."****
>
>  ****
>
> To me that is the same as saying: you can use anything. Of course we can
> wrap the "anything" in a field saying "here is MT engine information". Is
> that what you mean?****
>
>  ****
>
>  ****
>
> , the Text analysis may need something else****
>
>  ****
>
> I actually doubt that the text analysis "anything" will be more specific.
> My prediction is that there will be not more interop than saying "in this
> field there is data category specific information: ...".  ****
>
>  ****
>
> So you could achieve that by changing your proposal like this****
>
>  ****
>
> <its:processInfo>****
>
> ** **
>
>  <its:toolInfo xml:id="T1">****
>
>   <its:toolName>Bing Translator</its:toolName>****
>
>   <its:toolVersion>123</its:toolVersion>****
>
>   <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>
>
> ****
>
> ** **
>
>  <its:toolInfo>****
>
>  <its:toolInfo xml:id="T2">****
>
>   <its:toolName>myMT</its:toolName>****
>
>   <its:toolVersion>456</its:toolVersion>****
>
>   <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>****
>
>  ****
>
>  <its:toolInfo>****
>
> ** **
>
>  ****
>
> <its:processInfo>****
>
>  ****
>
> and allow for several addInfo elements in one "toolInfo". You won't gain a
> lot from these, but not less as with "FR-to-EN-General" inside "toolValue"
> at****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html
> ****
>
>  ****
>
> Best,****
>
>  ****
>
> Felix****
>
>  ****
>
>  ****
>
> , etc. It seems each data category will need one or two entry that mean
> different things depending on the data category. We can use a common
> element for this, but then we need to have one tool information per data
> category.
>
> Maybe the examples people are working on (action items 239 to 243 for
> Arle, Phil, Declan and Tadej) will help in defining this.
>
> Cheers
> -yves****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
> ** **
>



-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Thursday, 4 October 2012 15:40:54 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:31:55 UTC