RE: [xliff] ITS scope with sm/em from Yves Savourel on 2014-10-09 (public-i18n-its-ig@w3.org from October 2014)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Thu, 9 Oct 2014 13:16:01 -0600
To: "'Felix Sasaki'" <fsasaki@w3.org>
CC: "XLIFF Main List" <xliff@lists.oasis-open.org>, "'public-i18n-its-ig'" <public-i18n-its-ig@w3.org>
Message-ID: <00c401cfe3f5$7685da10$63918e30$@enlaso.com>
> Sorry, I don't get this.
> Do you have small examples (e.g. one for translate, on for text analysis)
> of the difference?

My understanding is that to work with an ITS processor we would change each span marked by <sm/>/<em/> to a set of <mrk>/</mrk>. At
least this is what I read from your algorithm (not from Fredrik's option).

For example:

<sm id='1' translate='no'/>French <pc id='2'>Canadian<em startRef='1'/> hockey</pc>.

Would be changed to:

<mrk id='1' translate='no'>French </mrk><pc id='2'><mrk id='m1' translate='no'>Canadian</mrk> hockey</pc>.

This would get you the same properties as the original for each character of the content.

But you can't do this for a data category where the content is related in a meaningful way to the attributes of the data category.
For example:

<sm id='1' type='term' ref="http://en.wikipedia.org/wiki/Qu%C3%A9b%C3%A9cois'/>French <pc id='2'>Canadian<em startRef='1'/>
hockey</pc>.
	
Cannot be split into two instances:

<mrk id='1' type='term' ref="http://en.wikipedia.org/wiki/Qu%C3%A9b%C3%A9cois'>French </mrk><pc id='2'><mrk id='3' type='term'
ref="http://en.wikipedia.org/wiki/Qu%C3%A9b%C3%A9cois'>Canadian</mrk> hockey</pc>.

That is: the term is not "French " or "Canadian" it is "French Canadian". 

But maybe I didn't get your algorithm correctly and it doesn't result in multiple <mrk>/</mrk> for a single <sm/>/<em/>.

Looking at Fredrik's note: there are various ways to reduce the <sm/>/<em/> in favor of well-formed <mrk> elements (joining all
segments, prioritizing well-formness of annotation over inline codes, etc.) but I'm not sure it can be an absolute solution: the
markup may originate from the XLIFF tool and care little about well-formness, and one always can have overlapping annotations.

But maybe this is a restriction we can live with.

-ys



-----Original Message-----
From: Felix Sasaki [mailto:felix@sasakiatcf.com] 
Sent: Thursday, October 9, 2014 12:20 PM
To: Yves Savourel
Cc: XLIFF Main List; public-i18n-its-ig
Subject: Re: [xliff] ITS scope with sm/em


Am 09.10.2014 um 14:18 schrieb Yves Savourel <ysavourel@enlaso.com>:

> Yes, something like the MT Confidence value is different, but that 
> conversion can be described in the mapping itself (If I recall correctly). So an ITS processor has nothing 'special' to do: it
just applies the rules.
> 
> I suppose we could have additional pre-processing steps for a case 
> like <sm>/<em>. But that means you can't really use a 'pure' ITS processor to look at an XLIFF file because it would not know how
to do the pre-processing.
> But that is probably acceptable, especially if we provide generic ways to do the transformation.
> 
> This said, I'm not 100% sure you can transform <sm>/<em> into 
> <mrk>/</mrk> for all data categories: it would be ok for things like translate, domain, etc. But info like terminology, Text
Analysis, LQI make sense only when set as a single content.

Sorry, I don't get this. Do you have small examples (e.g. one for translate, on for text analysis) of the difference?

- Felix

> 
> -ys
> 
> 
> -----Original Message-----
> From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] 
> On Behalf Of Felix Sasaki
> Sent: Thursday, October 9, 2014 5:53 AM
> To: Yves Savourel
> Cc: Dr. David Filip; Estreen, Fredrik; XLIFF Main List; 
> public-i18n-its-ig
> Subject: Re: [xliff] ITS scope with sm/em
> 
> Hi Yves, al,
> 
> understand. Though: aren't there also other parts of the xliff/its 
> mapping that require from an ITS1 or 2 processor special handling? 
> E.g. mt confidence
> https://www.w3.org/International/its/wiki/XLIFF_2.0_Mapping#MT_Confide
> nce_.28.3D.3D.3D.3D.3D.3D.3D.3D.3D.3DTO_REVIEW.29
> which requires a computation of values. In the other thread you just listed the types of processors:
> 
> "- An XLIFF Extractor aware of both ITS and the ITS module for any data coming from the original source document.
> - An XLIFF Modifier aware of the ITS Module for data generated during the life time of the XLIFF document.
> - An XLIFF Merger aware of both the ITS Module and the ITS syntax if any of that data is merged back into the translated
document." 
> 
> Couldn't we require in the mapping specification that before using a 
> general ITS processor uses XLIFF+ITS content, it has to do the preprocessing described in this thread, the one for MT confidence
etc?
> 
> Cheers,
> 
> Felix
> 
> Am 09.10.2014 um 13:33 schrieb Yves Savourel <ysavourel@enlaso.com>:
> 
>> Hi all,
>> 
>> Thanks for the input Fredrik and Felix.
>> 
>> I'm not worried about the XLIFF implementation of those cases: We 
>> have had working code for those since a long time (a good use
> case is mrk with translate='yes|no').
>> 
>> I was thinking more about the ITS aspect of it.
>> 
>>> From an ITS viewpoint something like this: <sm id='1' 
>>> itx:domain='travel'/>...<em startRef='1'/> the scope of the domain 
>>> is an
> empty content (the content of <sm/>). There is nothing in ITS that allows to use distinct elements to annotate a span.
>> 
>> Because, while on the XLIFF side the processing expectation is to 
>> treat the content between a given <sm/> and its corresponding
> <em/> as a span, on the ITS side there is no semantic for such construct.
>> 
>> Cheers,
>> -ys
>> 
>> 
>> From: Dr. David Filip [mailto:David.Filip@ul.ie]
>> Sent: Thursday, October 9, 2014 5:08 AM
>> To: Felix Sasaki
>> Cc: Estreen, Fredrik; Yves Savourel; XLIFF Main List; 
>> public-i18n-its-ig
>> Subject: Re: [xliff] ITS scope with sm/em
>> 
>> Felix, I like the algorithmic approach that is open to different implementations.
>> 
>> After all ITS is a set of abstract categories that should not be restricted to hierarchical structured formats.
>> 
>> Now to your proposed algorithm.
>> 
>> Unlike native codes, annotations MUST have the opening and closing tag in the same unit.
>> So you will be always creating <mrk> nodes from <sm/> tags if you consider the whole <unit> content, which is the point..
>> 
>> Cheers
>> dF
>> 
>> 
>> Dr. David Filip
>> =======================
>> OASIS XLIFF TC Secretary, Editor, and Liaison Officer LRC | CNGL | 
>> CSIS University of Limerick, Ireland
>> telephone: +353-6120-2781
>> cellphone: +353-86-0222-158
>> facsimile: +353-6120-2734
>> http://www.cngl.ie/profile/?i=452
>> mailto: david.filip@ul.ie
>> 
>> On Thu, Oct 9, 2014 at 3:49 AM, Felix Sasaki <felix@sasakiatcf.com> wrote:
>> I agree with Fredrik. Processing of overlapping hierarchies is a task 
>> that cannot be solved in general and discarding
> non-hierarchical structures is a good strategy for XML / HTML content.
>> 
>> 
>> If people don't want to specify an XSLT conversion we could also define the conversion process in an algorithmic way like this:
>> 
>> 0) set current content to whole content to be processed.
>> 1) is there an s tag in current content?
>>       Then output text before s tag and do 2)
>>       else just output all text in current content.
>> 2) has the s tag an e tag with corresponding id?
>>       Then create a mrk node
>>       set the content between s and e to new current content
>>       do 1)
>> else discard s and go to 1)
>> 3) output rest of text
>> 
>> and say: you can implement this as XSLT (example given) or in 
>> different programing languages. That would have the benefit to keep
> the door open to future non XML, API focsued XLIFF.
>> 
>> - Felix
>> 
>> Am 08.10.2014 um 18:41 schrieb Estreen, Fredrik <Fredrik.Estreen@lionbridge.com>:
>> 
>>> Hi Yves,
>>> 
>>>> Hi all,
>>>> 
>>>> Looking at the ITS mapping: In many case we put the ITS information 
>>>> on a marker (<mrk> element).
>>>> 
>>>> But such element can be represented by <sm/>...<em/> when it's 
>>>> overlapping another element.
>>>> In that case the normal ITS scope mechanism can't work because it 
>>>> applies to the empty content of <sm/>, not the content between 
>>>> <sm/> and the corresponding <em/>.
>>>> 
>>>> We can have provision for this in the XLIFF module. But I'm not 
>>>> sure it's doable in the ITS rules, especially with inheritance when 
>>>> there are nested annotations.
>>> 
>>> This is an interesting problem and I doubt it is solvable in a 
>>> general way without additional steps. It might be solvable when
> the <sm/> and <em/> is in the same segment, but I doubt it is in the case where they start and end in different segments (ie.
> different sibling trees).
>>> 
>>> One potentially workable solution would be to apply an XSLT transform on the XLIFF that merges all segments in each unit.
> Discards any non ITS carrying marker (to reduce risk of overlapping 
> markers) and finally normalize the remaining markers to  the 
> <mrk></mrk> spanning form. Since ITS information will likely be coming 
> from and going to an XML source there should not be any overlapping 
> markers at that stage as they would be difficult to represent in the source format. It is not guaranteed but we could declare that
ill-formed. ITS global rules could then be evaluated against the transformed version. Admittedly not the most beautiful solution but
I think it could work.
>>> 
>>>> I vaguely recall that such topic was discussed at some point in the ITS-WG no?
>>>> Does anyone recall the outcome?
>>>> 
>>>> Cheers,
>>>> -ys
>>> 
>>> Regards,
>>> Fredrik Estreen
>>> 
>>> --------------------------------------------------------------------
>>> - To unsubscribe from this mail list, you must leave the OASIS TC 
>>> that generates this mail.  Follow this link to all your TCs in OASIS 
>>> at:
>>> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.p
>>> h
>>> p
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe from this mail list, you must leave the OASIS TC that 
>> generates this mail.  Follow this link to all your TCs in OASIS at:
>> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.ph
>> p
>> 
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that 
> generates this mail.  Follow this link to all your TCs in OASIS
> at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
> 
> 
>
Received on Thursday, 9 October 2014 19:16:29 UTC