W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > October 2012

Re: [ISSUE-42] Wording for the tool information markup

From: Felix Sasaki <fsasaki@w3.org>
Date: Thu, 4 Oct 2012 14:57:45 +0200
Message-ID: <CAL58czr78Nk=7B5R1P9V0CRV0B_FGsOODg63Jj7GnfqfBFT_kw@mail.gmail.com>
To: Mārcis Pinnis <marcis.pinnis@tilde.lv>
Cc: Tatiana Gornostay <tatiana.gornostay@tilde.lv>, Yves Savourel <ysavourel@enlaso.com>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Raivis Skadiņš <Raivis.Skadins@tilde.lv>, Andrejs Vasiļjevs <Andrejs@tilde.lv>
2012/10/4 Mārcis Pinnis <marcis.pinnis@tilde.lv>

> Dear Felix,****
>
> ** **
>
> Having only the confidence distinguishing between an automatically
> identified term and a user approved term is not enough as various term
> annotation tools can have different confidence scores (they may be also in
> log form depending on the implementation). Thus having a strict value “1”
> for user approved/ term-bank based terms is not enough. In an ideal
> scenario, at least from my perspective, there should be a way to identify
> who (a system, which system, a user, who?, and authority, which authority?)
> annotated each term (not just in document level, but also in individual
> term level) and what is the confidence of the respective identifier given
> to the term candidate (or even a term).
>


Understand. That might bring us to "toolinfo" again. The solution that Yves
mentioned at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0035.html
would allow you to create identifiers for this complex type of information.


> ****
>
> ** **
>
> To make it a bit more simple, using only termConfidence to distinguish
> between user approved or trusted terms is not enough as the termConfidence
> is not reliable for such purposes.****
>
> ** **
>
> A natural representation, in my opinion, would identify the “annotator”
> (using categories – term bank, user, automatic tool, authority), the term
> confidence and the ID of the “annotator” (in order to identify the
> annotator precisely).****
>
> ** **
>
> Of course, for TermBank based terms there should be also a reference
> pointer so that more information could be identified.
>


Understand - the question mainly is: what needs to be standardized, and
what could be a URI to that complex information.



> ****
>
> ** **
>
> ** **
>
> ** **
>
> Actually ... one question that is* out of topic *here ... I tried
> following your discussions about the MT related “Translate” data category
> and a question arose: do you distinguish between something that:****
>
> **1)      **has to be passed through a translation system, but should not
> be translated (should be kept as is, but is helpful for disambiguation of
> the translatable parts);****
>
> **2)      **has to be completely ignored and not even passed through a
> translation system (for instance, numbers in tables, encrypted images
> within HTML5, etc.).****
>
> ** **
>
> From what I have understood (maybe I did not get the full picture) – the
> “Translate” tag is meant only for an MT system to tell it that something
> has to be kept as is, but some parts could be irrelevant to send through
> the MT systems, but that is not solved by the Translate tag.
>

"Translate" in fact is very general and doesn't distinguish between 1) and
2). E.g. IIRC, in Okapi it is used also to create pseudo translated text.

Best,

Felix


> ****
>
> ** **
>
> Best regards,****
>
> Mārcis Pinnis****
>
> Researcher****
>
> Tilde****
>
> ** **
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Thursday, October 04, 2012 2:54 PM
> *To:* Tatiana Gornostay
> *Cc:* Yves Savourel; public-multilingualweb-lt@w3.org; Mārcis Pinnis;
> Raivis Skadiņš; Andrejs Vasiļjevs
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
> ** **
>
> Dear Tatiana, all,****
>
> 2012/10/3 Tatiana Gornostay <tatiana.gornostay@tilde.lv>****
>
> Dear Felix, Yves, Dear All,****
>
>  ****
>
> W.r.t. the ongoing discussion on *toolInfo* and *mtConfidence*, I have in
> mind the following potential attributes proposed by Tilde in view of
> terminology use case, I mean, *its-termInfoRef*, *its-termCandidate*, and
> *its-termConfidence* and their values. ****
>
> ** **
>
> Would it also work to just add "termConfidence" to****
>
> ** **
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#terminology-implementation
> ****
>
> ** **
>
> we then could say: something is a term then the confidence is 1, that is *
> ***
>
> <span its:term="yes" its:termInfoRef="...">...</span> (ITS 1.0 or ITS 2.0)
> ****
>
> is equal to ****
>
> <span its:term="yes" its:termInfoRef="..." termConfidence="1">...</span>
> (ITS 2.0)****
>
> and a term candidate would be****
>
> <span its:term="yes" its:termInfoRef="..." termConfidence="0.9">...</span>
> (ITS 2.0)****
>
> ** **
>
> Felix ****
>
> These are not represented in the current draft  and if we go this way then
> we will have to discuss and, probably, add them. I can remember that Tadej
> raised this  questionin Prague and we did not talk about it, unfortunately.
> On the other hand, as soon as we start the project we will have opportunity
> and time to do it and my colleagues will also join the discussion.****
>
>  ****
>
> With best wishes,****
>
> Tatiana****
>
>  ****
>
> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Wednesday, October 03, 2012 12:29 AM
> *To:* Yves Savourel
> *Cc:* public-multilingualweb-lt@w3.org****
>
>
> *Subject:* Re: [ISSUE-42] Wording for the tool information markup****
>
>  ****
>
> Hi Yves, all,****
>
>  ****
>
> no opinion on my side on the delimiter topic, sorry for bringing it up. A
> comment on the tool specific aspect below.****
>
> 2012/10/2 Yves Savourel <ysavourel@enlaso.com>****
>
> > <doc its:toolRefs="mtConfidence/file:///tools.xml#T1"
> > xlmns:its="http://www.w3.org/2005/11/its">
> >****
>
> > Would it make sense to use a different delimiter? "/" may conflict with
> "/" in paths.****
>
> Hmm... almost any ASCII delimiter may also be in the path. The first
> occurrence is the delimiter.
> But I suppose '|' could be used instead. It just doesn't look as graceful
> for some reason.****
>
>
>
> > Do you need the "dataCategory" attribute? It seems the
> > data category is made explicit via the reference mechanism in
> "its:toolRefs".
> > Also, dropping the "dataCategory" attribute allows then to refer to
> > the same tools from various data categories - e.g. OKAPI used for quality
> > issue versus for creating translation metadata etc.****
>
> I'm not sure we can go from many data category instances to one tool
> information. And this is where I'm having trouble with tool information:
>
> The mtConfidence need to have a defined way to specify the engine used****
>
>  ****
>
> Is there really a defined way? The current version of the draft at****
>
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#mtconfidence-implementation
> ****
>
> says:****
>
>  ****
>
> "Some examples of values are:****
>
> A BCP 47 language tag with t-extension, e.g. ja-t-it for an Italian to
> Japanese MT engine****
>
> A Domain as per the Section 6.9: Domain****
>
> A privately structured string, eg. Domain:IT-Pair:IT-JA, IT-JA:Medical,
> etc."****
>
>  ****
>
> To me that is the same as saying: you can use anything. Of course we can
> wrap the "anything" in a field saying "here is MT engine information". Is
> that what you mean?****
>
>  ****
>
>  ****
>
> , the Text analysis may need something else****
>
>  ****
>
> I actually doubt that the text analysis "anything" will be more specific.
> My prediction is that there will be not more interop than saying "in this
> field there is data category specific information: ...".  ****
>
>  ****
>
> So you could achieve that by changing your proposal like this****
>
> ** **
>
> <its:processInfo>****
>
>  <its:toolInfo xml:id="T1">****
>
>   <its:toolName>Bing Translator</its:toolName>****
>
>   <its:toolVersion>123</its:toolVersion>****
>
>   <its:toolAddInfo datacategory="mtconfidence">ja-t-it</its:toolAddInfo>
>
> ****
>
>  <its:toolInfo>****
>
>  <its:toolInfo xml:id="T2">****
>
>   <its:toolName>myMT</its:toolName>****
>
>   <its:toolVersion>456</its:toolVersion>****
>
>   <its:toolAddInfo datacategory="mtconfidence">Domain:IT-Pair:IT-JA</its:AddInfo>****
>
>  ****
>
>  <its:toolInfo>****
>
> ** **
>
> <its:processInfo>****
>
>  ****
>
> and allow for several addInfo elements in one "toolInfo". You won't gain a
> lot from these, but not less as with "FR-to-EN-General" inside "toolValue"
> at****
>
>
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Oct/0000.html
> ****
>
>  ****
>
> Best,****
>
>  ****
>
> Felix****
>
>  ****
>
>  ****
>
> , etc. It seems each data category will need one or two entry that mean
> different things depending on the data category. We can use a common
> element for this, but then we need to have one tool information per data
> category.
>
> Maybe the examples people are working on (action items 239 to 243 for
> Arle, Phil, Declan and Tadej) will help in defining this.
>
> Cheers
> -yves
>
> ****
>
>
>
> ****
>
>  ****
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Felix Sasaki****
>
> DFKI / W3C Fellow****
>
> ** **
>



-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Thursday, 4 October 2012 12:58:16 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:31:55 UTC